CURSO DE R PARA ANÁLISE DE DADOS

Tidymodels

Tidymodels é um conjunto de bibliotecas que cuida de todos os passos necessários para desenvolver o workflow de seleção e avaliação de modelos de aprendizado estatístico.

O desenvolvimento é financiado pela RStudio e liderado por Max Kuhn, o principal desenvolvedor de uma biblioteca similar mais antiga: caret.

A tidymodels é toda tidy friendly. Essa é uma das diferenças em relação à caret. Ela também é mais completa e possui muito mais funcionalidades.

É possível obter mais informações em tidymodels.org

Bibliotecas ortogonais

Tidymodels é formada por pacotes ortogonais.

Este termo é emprestado da matemática. No caso de dois vetores ortogonais, podemos nos mover na direção de um deles sem que nossa projeção no outro seja alterada.

Em programação ou arquitetura de software dizemos que componentes ortogonais são desacoplados: a mudança em um componente não afeta outros. Esta propriedade exige componentes menores e mais coesos, com responsabilidades bem definidas, e permite alterações com menos efeitos colaterais. Um bom livro para quem quer entender como usar conceitos como esse em programação se chama Pragmatic Programmer, de David Thomas e Andrew Hunt.

As bibliotecas que compõem a tidymodels funcionam assim: ao configurar o workflow que vai implementar o processo de treinamento, seleção e avaliação de modelos, várias etapas ortogonais vão ser preparadas com uso de várias bibliotecas.

Principais bibliotecas

Estudo de caso

Os dados vieram de um estudo de pesquisadores da Columbia Business School, Ray Fisman and Sheena Iyenga.

Eles fizeram várias rodadas de encontros de 4 minutos entre homens e mulheres heterossexuais.

Várias características foram coletadas, incluindo um veredito final determinando se cada parceiro de encinto gostou do outro.

Dados do estudo de caso

Os dados foram coletados no site Kaggle

Eles não estão redondos…

dados_speed_date <- read_csv("dados/speed/Speed Dating Data.csv")


glimpse(dados_speed_date)
## Rows: 8,378
## Columns: 195
## $ iid      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ id       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ gender   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ idg      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ condtn   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ wave     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ round    <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10...
## $ position <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ positin1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ order    <dbl> 4, 3, 10, 5, 7, 6, 1, 2, 8, 9, 10, 9, 6, 1, 3, 2, 7, 8, 4,...
## $ partner  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
## $ pid      <dbl> 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 11, 12, 13, 14, 15...
## $ match    <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0...
## $ int_corr <dbl> 0.14, 0.54, 0.16, 0.61, 0.21, 0.25, 0.34, 0.50, 0.28, -0.3...
## $ samerace <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1...
## $ age_o    <dbl> 27, 22, 22, 23, 24, 25, 30, 27, 28, 24, 27, 22, 22, 23, 24...
## $ race_o   <dbl> 2, 2, 4, 2, 3, 2, 2, 2, 2, 2, 2, 2, 4, 2, 3, 2, 2, 2, 2, 2...
## $ pf_o_att <dbl> 35.00, 60.00, 19.00, 30.00, 30.00, 50.00, 35.00, 33.33, 50...
## $ pf_o_sin <dbl> 20.00, 0.00, 18.00, 5.00, 10.00, 0.00, 15.00, 11.11, 0.00,...
## $ pf_o_int <dbl> 20.00, 0.00, 19.00, 15.00, 20.00, 30.00, 25.00, 11.11, 25....
## $ pf_o_fun <dbl> 20.00, 40.00, 18.00, 40.00, 10.00, 10.00, 10.00, 11.11, 10...
## $ pf_o_amb <dbl> 0.00, 0.00, 14.00, 5.00, 10.00, 0.00, 5.00, 11.11, 0.00, 0...
## $ pf_o_sha <dbl> 5.00, 0.00, 12.00, 5.00, 20.00, 10.00, 10.00, 22.22, 15.00...
## $ dec_o    <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0...
## $ attr_o   <dbl> 6, 7, 10, 7, 8, 7, 3, 6, 7, 6, 8, 7, 10, 9, 10, 7, 5, 7, 8...
## $ sinc_o   <dbl> 8, 8, 10, 8, 7, 7, 6, 7, 7, 6, 7, 6, 10, 9, 10, 8, 3, 7, 6...
## $ intel_o  <dbl> 8, 10, 10, 9, 9, 8, 7, 5, 8, 6, 6, 10, 10, 9, 10, 7, 4, 7,...
## $ fun_o    <dbl> 8, 7, 10, 8, 6, 8, 5, 6, 8, 6, 9, 6, 10, 9, 10, 5, 3, 7, 9...
## $ amb_o    <dbl> 8, 7, 10, 9, 9, 7, 8, 8, 8, 6, 7, 6, 10, 9, 7, 7, 5, 7, 8,...
## $ shar_o   <dbl> 6, 5, 10, 8, 7, 7, 7, 6, 9, 6, 4, 5, 10, 9, 8, 7, 3, 5, 7,...
## $ like_o   <dbl> 7.0, 8.0, 10.0, 7.0, 8.0, 7.0, 2.0, 7.0, 6.5, 6.0, 7.0, 8....
## $ prob_o   <dbl> 4, 4, 10, 7, 6, 6, 1, 5, 8, 6, 2, 4, 10, 7, 1, 5, 3, 6, 8,...
## $ met_o    <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2...
## $ age      <dbl> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 24, 24, 24, 24, 24...
## $ field    <chr> "Law", "Law", "Law", "Law", "Law", "Law", "Law", "Law", "L...
## $ field_cd <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ undergra <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ mn_sat   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ tuition  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ race     <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ imprace  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ imprelig <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ from     <chr> "Chicago", "Chicago", "Chicago", "Chicago", "Chicago", "Ch...
## $ zipcode  <dbl> 60521, 60521, 60521, 60521, 60521, 60521, 60521, 60521, 60...
## $ income   <dbl> 69487, 69487, 69487, 69487, 69487, 69487, 69487, 69487, 69...
## $ goal     <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ date     <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ go_out   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ career   <chr> "lawyer", "lawyer", "lawyer", "lawyer", "lawyer", "lawyer"...
## $ career_c <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sports   <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ tvsports <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ exercise <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ dining   <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, ...
## $ museums  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ art      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ hiking   <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ gaming   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ clubbing <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ reading  <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 10, 10, 10, 10, 10, 10, 10, ...
## $ tv       <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ theater  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ movies   <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 8, 8, 8, 8, 8, 8, ...
## $ concerts <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 7, 7, 7, 7, 7, 7, ...
## $ music    <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ shopping <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ yoga     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ exphappy <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ expnum   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ attr1_1  <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 45, 45, 45, 45, 45...
## $ sinc1_1  <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 5, 5, 5, 5, 5, 5, ...
## $ intel1_1 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 25, 25, 25, 25, 25...
## $ fun1_1   <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 20, 20, 20, 20, 20...
## $ amb1_1   <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, ...
## $ shar1_1  <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 5, 5, 5, 5, 5, 5, ...
## $ attr4_1  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc4_1  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel4_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun4_1   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb4_1   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar4_1  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr2_1  <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 65, 65, 65, 65, 65...
## $ sinc2_1  <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 0, 0, 0, 0, 0, 0, ...
## $ intel2_1 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 10, 10, 10, 10, 10...
## $ fun2_1   <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 25, 25, 25, 25, 25...
## $ amb2_1   <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ shar2_1  <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attr3_1  <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ sinc3_1  <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ fun3_1   <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 10, ...
## $ intel3_1 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ amb3_1   <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ attr5_1  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc5_1  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel5_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun5_1   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb5_1   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ dec      <dbl> 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1...
## $ attr     <dbl> 6, 7, 5, 7, 5, 4, 7, 4, 7, 5, 5, 8, 5, 7, 6, 8, 7, 5, 7, 6...
## $ sinc     <dbl> 9, 8, 8, 6, 6, 9, 6, 9, 6, 6, 7, 5, 8, 9, 8, 7, 5, 8, 6, 7...
## $ intel    <dbl> 7, 7, 9, 8, 7, 7, 7, 7, 8, 6, 8, 6, 9, 7, 7, 8, 9, 7, 8, 8...
## $ fun      <dbl> 7, 8, 8, 7, 7, 4, 4, 6, 9, 8, 4, 6, 6, 6, 9, 3, 6, 5, 9, 7...
## $ amb      <dbl> 6, 5, 5, 6, 6, 6, 6, 5, 8, 10, 6, 9, 3, 5, 7, 6, 7, 9, 4, ...
## $ shar     <dbl> 5, 6, 7, 8, 6, 4, 7, 6, 8, 8, 3, 6, 4, 7, 8, 2, 9, 5, 5, 8...
## $ like     <dbl> 7, 7, 7, 7, 6, 6, 6, 6, 7, 6, 6, 7, 6, 7, 8, 6, 8, 5, 5, 8...
## $ prob     <dbl> 6, 5, NA, 6, 6, 5, 5, 7, 7, 6, 4, 3, 7, 8, 6, 5, 7, 6, 6, ...
## $ met      <dbl> 2, 1, 1, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, ...
## $ match_es <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ attr1_s  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc1_s  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel1_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun1_s   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb1_s   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar1_s  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr3_s  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc3_s  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel3_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun3_s   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb3_s   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ satis_2  <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ length   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ numdat_2 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, NA, ...
## $ attr7_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc7_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel7_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun7_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb7_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar7_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr1_2  <dbl> 19.44, 19.44, 19.44, 19.44, 19.44, 19.44, 19.44, 19.44, 19...
## $ sinc1_2  <dbl> 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16...
## $ intel1_2 <dbl> 13.89, 13.89, 13.89, 13.89, 13.89, 13.89, 13.89, 13.89, 13...
## $ fun1_2   <dbl> 22.22, 22.22, 22.22, 22.22, 22.22, 22.22, 22.22, 22.22, 22...
## $ amb1_2   <dbl> 11.11, 11.11, 11.11, 11.11, 11.11, 11.11, 11.11, 11.11, 11...
## $ shar1_2  <dbl> 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16...
## $ attr4_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc4_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel4_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun4_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb4_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar4_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr2_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc2_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel2_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun2_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb2_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar2_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr3_2  <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ sinc3_2  <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ intel3_2 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ fun3_2   <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ amb3_2   <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ attr5_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc5_2  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel5_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun5_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb5_2   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ you_call <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ them_cal <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ date_3   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ numdat_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ num_in_3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr1_3  <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 30, 30, 30, 30, 30...
## $ sinc1_3  <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 5, 5, 5, 5, 5, 5, ...
## $ intel1_3 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 40, 40, 40, 40, 40...
## $ fun1_3   <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15...
## $ amb1_3   <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, ...
## $ shar1_3  <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 10, 10, 10, 10, 10...
## $ attr7_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc7_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel7_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun7_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb7_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar7_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr4_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc4_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel4_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun4_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb4_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar4_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr2_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc2_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel2_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun2_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb2_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar2_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr3_3  <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ sinc3_3  <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ intel3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ fun3_3   <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ amb3_3   <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ attr5_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc5_3  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel5_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun5_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb5_3   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

Renomeando coluna por coluna

Algumas colunas devem ser renomeadas para nomes mais inteligíveis

dados_speed_date_renomeado <- dados_speed_date %>% 
  rename(
    unique_id_number = iid,
    id_within_wave = id,
    male = gender,
    subject_within_gender = idg,
    choice = condtn,
    n_people_met_in_wave = round,
    position_meeting = position,
    position_started = positin1,
    order_meeting = order,
    partnet_id_within_wave = partner,
    partner_unique_id_number =pid ,
    interests_correlation = int_corr,
    same_race = samerace,
    my_age = age,
    partner_age = age_o,
    partner_race = race_o,
    partner_stated_pref_time0_attractive = pf_o_att,
    partner_stated_pref_time0_sincere = pf_o_sin,
    partner_stated_pref_time0_intelligent = pf_o_int,
    partner_stated_pref_time0_fun = pf_o_fun,
    partner_stated_pref_time0_ambitious = pf_o_amb,
    partner_stated_pref_time0_shared_interests = pf_o_sha,
    cod_field = field_cd,
    importance_same_race = imprace,
    importance_same_religion = imprelig,
    place_from = from,
    zipcode = zipcode,
    income_zipcode = income,
    frequency_date = date,
    frequency_go_out = go_out,
    career_macro = career_c,
    happy_expec = exphappy,
    n_expect_like_you = expnum,
    i_liked_partner = dec,
    partner_liked_me = dec_o,
    i_found_partner__attractive = attr,
    i_found_partner__sincere = sinc,
    i_found_partner__intelligent = intel,
    i_found_partner__fun = fun,
    i_found_partner__ambitious = amb,
    i_found_partner__interests = shar,
    degree_i_liked_partner = like,
    partner_found_me__attractive = attr_o,
    partner_found_me__sincere = sinc_o,
    partner_found_me__intelligent = intel_o,
    partner_found_me__fun = fun_o,
    partner_found_me__ambitious = amb_o,
    partner_found_me__interests = shar_o,
    probability_i_find_partner_liked_me = prob,
    met_before = met,
    n_matches_you_think = match_es,
    satisfaction_with_partners = satis_2,
    opinion_duration_of_date = length,
    opinion_num_dates = numdat_2,
    num_matches_you_called = you_call,
    num_matches_called_you = them_cal,
    have_you_dated = date_3
    
  )

glimpse(dados_speed_date_renomeado)
## Rows: 8,378
## Columns: 195
## $ unique_id_number                           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ id_within_wave                             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ male                                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ subject_within_gender                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ choice                                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ wave                                       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ n_people_met_in_wave                       <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting                           <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ position_started                           <lgl> NA, NA, NA, NA, NA, NA, ...
## $ order_meeting                              <dbl> 4, 3, 10, 5, 7, 6, 1, 2,...
## $ partnet_id_within_wave                     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, ...
## $ partner_unique_id_number                   <dbl> 11, 12, 13, 14, 15, 16, ...
## $ match                                      <dbl> 0, 0, 1, 1, 1, 0, 0, 0, ...
## $ interests_correlation                      <dbl> 0.14, 0.54, 0.16, 0.61, ...
## $ same_race                                  <dbl> 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ partner_age                                <dbl> 27, 22, 22, 23, 24, 25, ...
## $ partner_race                               <dbl> 2, 2, 4, 2, 3, 2, 2, 2, ...
## $ partner_stated_pref_time0_attractive       <dbl> 35.00, 60.00, 19.00, 30....
## $ partner_stated_pref_time0_sincere          <dbl> 20.00, 0.00, 18.00, 5.00...
## $ partner_stated_pref_time0_intelligent      <dbl> 20.00, 0.00, 19.00, 15.0...
## $ partner_stated_pref_time0_fun              <dbl> 20.00, 40.00, 18.00, 40....
## $ partner_stated_pref_time0_ambitious        <dbl> 0.00, 0.00, 14.00, 5.00,...
## $ partner_stated_pref_time0_shared_interests <dbl> 5.00, 0.00, 12.00, 5.00,...
## $ partner_liked_me                           <dbl> 0, 0, 1, 1, 1, 1, 0, 0, ...
## $ partner_found_me__attractive               <dbl> 6, 7, 10, 7, 8, 7, 3, 6,...
## $ partner_found_me__sincere                  <dbl> 8, 8, 10, 8, 7, 7, 6, 7,...
## $ partner_found_me__intelligent              <dbl> 8, 10, 10, 9, 9, 8, 7, 5...
## $ partner_found_me__fun                      <dbl> 8, 7, 10, 8, 6, 8, 5, 6,...
## $ partner_found_me__ambitious                <dbl> 8, 7, 10, 9, 9, 7, 8, 8,...
## $ partner_found_me__interests                <dbl> 6, 5, 10, 8, 7, 7, 7, 6,...
## $ like_o                                     <dbl> 7.0, 8.0, 10.0, 7.0, 8.0...
## $ prob_o                                     <dbl> 4, 4, 10, 7, 6, 6, 1, 5,...
## $ met_o                                      <dbl> 2, 2, 1, 2, 2, 2, 2, 2, ...
## $ my_age                                     <dbl> 21, 21, 21, 21, 21, 21, ...
## $ field                                      <chr> "Law", "Law", "Law", "La...
## $ cod_field                                  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ undergra                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ mn_sat                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ tuition                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ race                                       <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ importance_same_race                       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ importance_same_religion                   <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ place_from                                 <chr> "Chicago", "Chicago", "C...
## $ zipcode                                    <dbl> 60521, 60521, 60521, 605...
## $ income_zipcode                             <dbl> 69487, 69487, 69487, 694...
## $ goal                                       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ frequency_date                             <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ frequency_go_out                           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ career                                     <chr> "lawyer", "lawyer", "law...
## $ career_macro                               <dbl> NA, NA, NA, NA, NA, NA, ...
## $ sports                                     <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ tvsports                                   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ exercise                                   <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ dining                                     <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ museums                                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ art                                        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ hiking                                     <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ gaming                                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ clubbing                                   <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ reading                                    <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ tv                                         <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ theater                                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ movies                                     <dbl> 10, 10, 10, 10, 10, 10, ...
## $ concerts                                   <dbl> 10, 10, 10, 10, 10, 10, ...
## $ music                                      <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ shopping                                   <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ yoga                                       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ happy_expec                                <dbl> 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ n_expect_like_you                          <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ attr1_1                                    <dbl> 15, 15, 15, 15, 15, 15, ...
## $ sinc1_1                                    <dbl> 20, 20, 20, 20, 20, 20, ...
## $ intel1_1                                   <dbl> 20, 20, 20, 20, 20, 20, ...
## $ fun1_1                                     <dbl> 15, 15, 15, 15, 15, 15, ...
## $ amb1_1                                     <dbl> 15, 15, 15, 15, 15, 15, ...
## $ shar1_1                                    <dbl> 15, 15, 15, 15, 15, 15, ...
## $ attr4_1                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc4_1                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel4_1                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun4_1                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb4_1                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar4_1                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr2_1                                    <dbl> 35, 35, 35, 35, 35, 35, ...
## $ sinc2_1                                    <dbl> 20, 20, 20, 20, 20, 20, ...
## $ intel2_1                                   <dbl> 15, 15, 15, 15, 15, 15, ...
## $ fun2_1                                     <dbl> 20, 20, 20, 20, 20, 20, ...
## $ amb2_1                                     <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ shar2_1                                    <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ attr3_1                                    <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ sinc3_1                                    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ fun3_1                                     <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ intel3_1                                   <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ amb3_1                                     <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ attr5_1                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc5_1                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel5_1                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun5_1                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb5_1                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ i_liked_partner                            <dbl> 1, 1, 1, 1, 1, 0, 1, 0, ...
## $ i_found_partner__attractive                <dbl> 6, 7, 5, 7, 5, 4, 7, 4, ...
## $ i_found_partner__sincere                   <dbl> 9, 8, 8, 6, 6, 9, 6, 9, ...
## $ i_found_partner__intelligent               <dbl> 7, 7, 9, 8, 7, 7, 7, 7, ...
## $ i_found_partner__fun                       <dbl> 7, 8, 8, 7, 7, 4, 4, 6, ...
## $ i_found_partner__ambitious                 <dbl> 6, 5, 5, 6, 6, 6, 6, 5, ...
## $ i_found_partner__interests                 <dbl> 5, 6, 7, 8, 6, 4, 7, 6, ...
## $ degree_i_liked_partner                     <dbl> 7, 7, 7, 7, 6, 6, 6, 6, ...
## $ probability_i_find_partner_liked_me        <dbl> 6, 5, NA, 6, 6, 5, 5, 7,...
## $ met_before                                 <dbl> 2, 1, 1, 2, 2, 2, 2, NA,...
## $ n_matches_you_think                        <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ attr1_s                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc1_s                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel1_s                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun1_s                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb1_s                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar1_s                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr3_s                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc3_s                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel3_s                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun3_s                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb3_s                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ satisfaction_with_partners                 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ opinion_duration_of_date                   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ opinion_num_dates                          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ attr7_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc7_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel7_2                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun7_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb7_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar7_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr1_2                                    <dbl> 19.44, 19.44, 19.44, 19....
## $ sinc1_2                                    <dbl> 16.67, 16.67, 16.67, 16....
## $ intel1_2                                   <dbl> 13.89, 13.89, 13.89, 13....
## $ fun1_2                                     <dbl> 22.22, 22.22, 22.22, 22....
## $ amb1_2                                     <dbl> 11.11, 11.11, 11.11, 11....
## $ shar1_2                                    <dbl> 16.67, 16.67, 16.67, 16....
## $ attr4_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc4_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel4_2                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun4_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb4_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar4_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr2_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc2_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel2_2                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun2_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb2_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar2_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr3_2                                    <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ sinc3_2                                    <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ intel3_2                                   <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ fun3_2                                     <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ amb3_2                                     <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ attr5_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc5_2                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel5_2                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun5_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb5_2                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ num_matches_you_called                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ num_matches_called_you                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ have_you_dated                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ numdat_3                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ num_in_3                                   <dbl> NA, NA, NA, NA, NA, NA, ...
## $ attr1_3                                    <dbl> 15, 15, 15, 15, 15, 15, ...
## $ sinc1_3                                    <dbl> 20, 20, 20, 20, 20, 20, ...
## $ intel1_3                                   <dbl> 20, 20, 20, 20, 20, 20, ...
## $ fun1_3                                     <dbl> 15, 15, 15, 15, 15, 15, ...
## $ amb1_3                                     <dbl> 15, 15, 15, 15, 15, 15, ...
## $ shar1_3                                    <dbl> 15, 15, 15, 15, 15, 15, ...
## $ attr7_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc7_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel7_3                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun7_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb7_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar7_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr4_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc4_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel4_3                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun4_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb4_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar4_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr2_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc2_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel2_3                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun2_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb2_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar2_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr3_3                                    <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ sinc3_3                                    <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ intel3_3                                   <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ fun3_3                                     <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ amb3_3                                     <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ attr5_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc5_3                                    <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel5_3                                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun5_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb5_3                                     <lgl> NA, NA, NA, NA, NA, NA, ...

Renomeando conjuntos misteriosos em lote

Ainda há colunas com sufixos misteriosos, como 1_1

adjust_column_feature <- function(x, suffix, meaning ){
      

      suffix_removed <- str_remove(string = x, pattern = suffix)
      
      type <-  case_when(
        suffix_removed == "attr" ~ "attractive",
        suffix_removed == "sinc" ~ "sincere",
        suffix_removed == "intel" ~ "intelligent",
        suffix_removed == "fun" ~ "fun",
        suffix_removed == "amb" ~ "ambitious",
        suffix_removed == "shar" ~ "shared_interests"
      ) 

      str_glue("{meaning}_{type}")
      
}




dados_speed_date_rename_with <- dados_speed_date_renomeado %>% 
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)4_1"),
    .fn = ~adjust_column_feature(x = .x, suffix = "4_1", meaning = "competitors_look_for_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)4_2"),
    .fn = ~adjust_column_feature(x = .x, suffix = "4_2", meaning = "competitors_look_for_follow_up_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)4_3"),
    .fn = ~adjust_column_feature(x = .x, suffix = "4_3", meaning = "competitors_look_for_follow_up_weeks_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_1"),
    .fn = ~adjust_column_feature(x = .x, suffix = "1_1", meaning = "you_look_for_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_s"),
    .fn = ~adjust_column_feature(x = .x, suffix = "1_s", meaning = "you_look_for_half_way_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_2"),
    .fn = ~adjust_column_feature(x = .x, suffix = "1_2", meaning = "you_look_for_follow_up_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_3"),
    .fn = ~adjust_column_feature(x = .x, suffix = "1_3", meaning = "you_look_for_follow_up_weeks_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)2_1"),
    .fn = ~adjust_column_feature(x = .x, suffix = "2_1", meaning = "you_think_opposite_sex_look_for_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)2_2"),
    .fn = ~adjust_column_feature(x = .x, suffix = "2_2", meaning = "you_think_opposite_sex_look_for_follow_up_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)2_3"),
    .fn = ~adjust_column_feature(x = .x, suffix = "2_3", meaning = "you_think_opposite_sex_look_for_follow_up_weeks_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)5_1"),
    .fn = ~adjust_column_feature(x = .x, suffix = "5_1", meaning = "others_perceive_you_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)5_2"),
    .fn = ~adjust_column_feature(x = .x, suffix = "5_2", meaning = "others_perceive_you_follow_up_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)5_3"),
    .fn = ~adjust_column_feature(x = .x, suffix = "5_3", meaning = "others_perceive_you_follow_up_weeks_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_1"),
    .fn = ~adjust_column_feature(x = .x, suffix = "3_1", meaning = "you_perceive_yourself_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_2"),
    .fn = ~adjust_column_feature(x = .x, suffix = "3_2", meaning = "you_perceive_yourself_follow_up_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_s"),
    .fn = ~adjust_column_feature(x = .x, suffix = "3_s", meaning = "you_perceive_yourself_half_way_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_3"),
    .fn = ~adjust_column_feature(x = .x, suffix = "3_3", meaning = "you_perceive_yourself_follow_up_weeks_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)7_2"),
    .fn = ~adjust_column_feature(x = .x, suffix = "7_2", meaning = "actual_importance_")
  ) %>%
  rename_with(
    .cols = matches("^(?:attr|sinc|intel|fun|amb|shar)7_3"),
    .fn = ~adjust_column_feature(x = .x, suffix = "7_3", meaning = "actual_importance_follow_up_weeks_")
  ) %>%
  select(  
    -c(
      undergra,
      mn_sat,
      tuition
    )
  )  

glimpse(dados_speed_date_rename_with)
## Rows: 8,378
## Columns: 192
## $ unique_id_number                                                  <dbl> 1...
## $ id_within_wave                                                    <dbl> 1...
## $ male                                                              <dbl> 0...
## $ subject_within_gender                                             <dbl> 1...
## $ choice                                                            <dbl> 1...
## $ wave                                                              <dbl> 1...
## $ n_people_met_in_wave                                              <dbl> 1...
## $ position_meeting                                                  <dbl> 7...
## $ position_started                                                  <lgl> N...
## $ order_meeting                                                     <dbl> 4...
## $ partnet_id_within_wave                                            <dbl> 1...
## $ partner_unique_id_number                                          <dbl> 1...
## $ match                                                             <dbl> 0...
## $ interests_correlation                                             <dbl> 0...
## $ same_race                                                         <dbl> 0...
## $ partner_age                                                       <dbl> 2...
## $ partner_race                                                      <dbl> 2...
## $ partner_stated_pref_time0_attractive                              <dbl> 3...
## $ partner_stated_pref_time0_sincere                                 <dbl> 2...
## $ partner_stated_pref_time0_intelligent                             <dbl> 2...
## $ partner_stated_pref_time0_fun                                     <dbl> 2...
## $ partner_stated_pref_time0_ambitious                               <dbl> 0...
## $ partner_stated_pref_time0_shared_interests                        <dbl> 5...
## $ partner_liked_me                                                  <dbl> 0...
## $ partner_found_me__attractive                                      <dbl> 6...
## $ partner_found_me__sincere                                         <dbl> 8...
## $ partner_found_me__intelligent                                     <dbl> 8...
## $ partner_found_me__fun                                             <dbl> 8...
## $ partner_found_me__ambitious                                       <dbl> 8...
## $ partner_found_me__interests                                       <dbl> 6...
## $ like_o                                                            <dbl> 7...
## $ prob_o                                                            <dbl> 4...
## $ met_o                                                             <dbl> 2...
## $ my_age                                                            <dbl> 2...
## $ field                                                             <chr> "...
## $ cod_field                                                         <dbl> 1...
## $ race                                                              <dbl> 4...
## $ importance_same_race                                              <dbl> 2...
## $ importance_same_religion                                          <dbl> 4...
## $ place_from                                                        <chr> "...
## $ zipcode                                                           <dbl> 6...
## $ income_zipcode                                                    <dbl> 6...
## $ goal                                                              <dbl> 2...
## $ frequency_date                                                    <dbl> 7...
## $ frequency_go_out                                                  <dbl> 1...
## $ career                                                            <chr> "...
## $ career_macro                                                      <dbl> N...
## $ sports                                                            <dbl> 9...
## $ tvsports                                                          <dbl> 2...
## $ exercise                                                          <dbl> 8...
## $ dining                                                            <dbl> 9...
## $ museums                                                           <dbl> 1...
## $ art                                                               <dbl> 1...
## $ hiking                                                            <dbl> 5...
## $ gaming                                                            <dbl> 1...
## $ clubbing                                                          <dbl> 5...
## $ reading                                                           <dbl> 6...
## $ tv                                                                <dbl> 9...
## $ theater                                                           <dbl> 1...
## $ movies                                                            <dbl> 1...
## $ concerts                                                          <dbl> 1...
## $ music                                                             <dbl> 9...
## $ shopping                                                          <dbl> 8...
## $ yoga                                                              <dbl> 1...
## $ happy_expec                                                       <dbl> 3...
## $ n_expect_like_you                                                 <dbl> 2...
## $ you_look_for__attractive                                          <dbl> 1...
## $ you_look_for__sincere                                             <dbl> 2...
## $ you_look_for__intelligent                                         <dbl> 2...
## $ you_look_for__fun                                                 <dbl> 1...
## $ you_look_for__ambitious                                           <dbl> 1...
## $ you_look_for__shared_interests                                    <dbl> 1...
## $ competitors_look_for__attractive                                  <lgl> N...
## $ competitors_look_for__sincere                                     <lgl> N...
## $ competitors_look_for__intelligent                                 <lgl> N...
## $ competitors_look_for__fun                                         <lgl> N...
## $ competitors_look_for__ambitious                                   <lgl> N...
## $ competitors_look_for__shared_interests                            <lgl> N...
## $ you_think_opposite_sex_look_for__attractive                       <dbl> 3...
## $ you_think_opposite_sex_look_for__sincere                          <dbl> 2...
## $ you_think_opposite_sex_look_for__intelligent                      <dbl> 1...
## $ you_think_opposite_sex_look_for__fun                              <dbl> 2...
## $ you_think_opposite_sex_look_for__ambitious                        <dbl> 5...
## $ you_think_opposite_sex_look_for__shared_interests                 <dbl> 5...
## $ you_perceive_yourself__attractive                                 <dbl> 6...
## $ you_perceive_yourself__sincere                                    <dbl> 8...
## $ you_perceive_yourself__fun                                        <dbl> 8...
## $ you_perceive_yourself__intelligent                                <dbl> 8...
## $ you_perceive_yourself__ambitious                                  <dbl> 7...
## $ others_perceive_you__attractive                                   <lgl> N...
## $ others_perceive_you__sincere                                      <lgl> N...
## $ others_perceive_you__intelligent                                  <lgl> N...
## $ others_perceive_you__fun                                          <lgl> N...
## $ others_perceive_you__ambitious                                    <lgl> N...
## $ i_liked_partner                                                   <dbl> 1...
## $ i_found_partner__attractive                                       <dbl> 6...
## $ i_found_partner__sincere                                          <dbl> 9...
## $ i_found_partner__intelligent                                      <dbl> 7...
## $ i_found_partner__fun                                              <dbl> 7...
## $ i_found_partner__ambitious                                        <dbl> 6...
## $ i_found_partner__interests                                        <dbl> 5...
## $ degree_i_liked_partner                                            <dbl> 7...
## $ probability_i_find_partner_liked_me                               <dbl> 6...
## $ met_before                                                        <dbl> 2...
## $ n_matches_you_think                                               <dbl> 4...
## $ you_look_for_half_way__attractive                                 <lgl> N...
## $ you_look_for_half_way__sincere                                    <lgl> N...
## $ you_look_for_half_way__intelligent                                <lgl> N...
## $ you_look_for_half_way__fun                                        <lgl> N...
## $ you_look_for_half_way__ambitious                                  <lgl> N...
## $ you_look_for_half_way__shared_interests                           <lgl> N...
## $ you_perceive_yourself_half_way__attractive                        <lgl> N...
## $ you_perceive_yourself_half_way__sincere                           <lgl> N...
## $ you_perceive_yourself_half_way__intelligent                       <lgl> N...
## $ you_perceive_yourself_half_way__fun                               <lgl> N...
## $ you_perceive_yourself_half_way__ambitious                         <lgl> N...
## $ satisfaction_with_partners                                        <dbl> 6...
## $ opinion_duration_of_date                                          <dbl> 2...
## $ opinion_num_dates                                                 <dbl> 1...
## $ actual_importance__attractive                                     <lgl> N...
## $ actual_importance__sincere                                        <lgl> N...
## $ actual_importance__intelligent                                    <lgl> N...
## $ actual_importance__fun                                            <lgl> N...
## $ actual_importance__ambitious                                      <lgl> N...
## $ actual_importance__shared_interests                               <lgl> N...
## $ you_look_for_follow_up__attractive                                <dbl> 1...
## $ you_look_for_follow_up__sincere                                   <dbl> 1...
## $ you_look_for_follow_up__intelligent                               <dbl> 1...
## $ you_look_for_follow_up__fun                                       <dbl> 2...
## $ you_look_for_follow_up__ambitious                                 <dbl> 1...
## $ you_look_for_follow_up__shared_interests                          <dbl> 1...
## $ competitors_look_for_follow_up__attractive                        <lgl> N...
## $ competitors_look_for_follow_up__sincere                           <lgl> N...
## $ competitors_look_for_follow_up__intelligent                       <lgl> N...
## $ competitors_look_for_follow_up__fun                               <lgl> N...
## $ competitors_look_for_follow_up__ambitious                         <lgl> N...
## $ competitors_look_for_follow_up__shared_interests                  <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__attractive             <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__sincere                <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__intelligent            <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__fun                    <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__ambitious              <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__shared_interests       <lgl> N...
## $ you_perceive_yourself_follow_up__attractive                       <dbl> 6...
## $ you_perceive_yourself_follow_up__sincere                          <dbl> 7...
## $ you_perceive_yourself_follow_up__intelligent                      <dbl> 8...
## $ you_perceive_yourself_follow_up__fun                              <dbl> 7...
## $ you_perceive_yourself_follow_up__ambitious                        <dbl> 6...
## $ others_perceive_you_follow_up__attractive                         <lgl> N...
## $ others_perceive_you_follow_up__sincere                            <lgl> N...
## $ others_perceive_you_follow_up__intelligent                        <lgl> N...
## $ others_perceive_you_follow_up__fun                                <lgl> N...
## $ others_perceive_you_follow_up__ambitious                          <lgl> N...
## $ num_matches_you_called                                            <dbl> 1...
## $ num_matches_called_you                                            <dbl> 1...
## $ have_you_dated                                                    <dbl> 0...
## $ numdat_3                                                          <lgl> N...
## $ num_in_3                                                          <dbl> N...
## $ you_look_for_follow_up_weeks__attractive                          <dbl> 1...
## $ you_look_for_follow_up_weeks__sincere                             <dbl> 2...
## $ you_look_for_follow_up_weeks__intelligent                         <dbl> 2...
## $ you_look_for_follow_up_weeks__fun                                 <dbl> 1...
## $ you_look_for_follow_up_weeks__ambitious                           <dbl> 1...
## $ you_look_for_follow_up_weeks__shared_interests                    <dbl> 1...
## $ actual_importance_follow_up_weeks__attractive                     <lgl> N...
## $ actual_importance_follow_up_weeks__sincere                        <lgl> N...
## $ actual_importance_follow_up_weeks__intelligent                    <lgl> N...
## $ actual_importance_follow_up_weeks__fun                            <lgl> N...
## $ actual_importance_follow_up_weeks__ambitious                      <lgl> N...
## $ actual_importance_follow_up_weeks__shared_interests               <lgl> N...
## $ competitors_look_for_follow_up_weeks__attractive                  <lgl> N...
## $ competitors_look_for_follow_up_weeks__sincere                     <lgl> N...
## $ competitors_look_for_follow_up_weeks__intelligent                 <lgl> N...
## $ competitors_look_for_follow_up_weeks__fun                         <lgl> N...
## $ competitors_look_for_follow_up_weeks__ambitious                   <lgl> N...
## $ competitors_look_for_follow_up_weeks__shared_interests            <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__attractive       <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__sincere          <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__intelligent      <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__fun              <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__ambitious        <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__shared_interests <lgl> N...
## $ you_perceive_yourself_follow_up_weeks__attractive                 <dbl> 5...
## $ you_perceive_yourself_follow_up_weeks__sincere                    <dbl> 7...
## $ you_perceive_yourself_follow_up_weeks__intelligent                <dbl> 7...
## $ you_perceive_yourself_follow_up_weeks__fun                        <dbl> 7...
## $ you_perceive_yourself_follow_up_weeks__ambitious                  <dbl> 7...
## $ others_perceive_you_follow_up_weeks__attractive                   <lgl> N...
## $ others_perceive_you_follow_up_weeks__sincere                      <lgl> N...
## $ others_perceive_you_follow_up_weeks__intelligent                  <lgl> N...
## $ others_perceive_you_follow_up_weeks__fun                          <lgl> N...
## $ others_perceive_you_follow_up_weeks__ambitious                    <lgl> N...

Traduzindo códigos para fatores com strings

Muitos atributos estão codificados numericamente, o que atrapalha a interpretação, eles foram transformados em vetores de caracteres.

Essa codificação numérica é muito comum em produtos de análise estatística de prateleira, que possibilitam point-and-click.

Reparem que o atributo frequency_date foi transformado em um fator onde os levels tem uma ordem espcífica. Isso trará implicações posteriores.

  dados_speed_date_fatores <- dados_speed_date_rename_with %>%  mutate(
    choice = if_else(choice == 1, "limited", "extensive") ,
    field_factor = case_when(
      cod_field == 1 ~ "Law",
      cod_field == 2 ~ "Math",
      cod_field == 3 ~ "Social Science, Psychologist" ,
      cod_field == 4 ~ "Medical Science, Pharmaceuticals, and Bio Tech" ,
      cod_field == 5 ~ "Engineering"  ,
      cod_field == 6 ~ "English/Creative Writing/ Journalism" ,
      cod_field == 7 ~ "History/Religion/Philosophy" ,
      cod_field == 8 ~ "Business/Econ/Finance" ,
      cod_field == 9 ~ "Education, Academia" ,
      cod_field == 10 ~ "Biological Sciences/Chemistry/Physics",
      cod_field == 11 ~ "Social Work" ,
      cod_field == 12 ~ "Undergrad/undecided" ,
      cod_field == 13 ~ "Political Science/International Affairs" ,
      cod_field == 14 ~ "Film",
      cod_field == 15 ~ "Fine Arts/Arts Administration",
      cod_field == 16 ~ "Languages",
      cod_field == 17 ~ "Architecture",
      cod_field == 18 ~ "Other"
    ),
    
    race = case_when(
      race == 1 ~ "Black",
      race == 2 ~ "White",
      race == 3 ~ "Latino", 
      race == 4 ~ "Asian" ,
      race == 5 ~ "Native American"  ,
      race == 6 ~ "Others" 
    ),
    
    partner_race = case_when(
      partner_race == 1 ~ "Black",
      partner_race == 2 ~ "White",
      partner_race == 3 ~ "Latino", 
      partner_race == 4 ~ "Asian" ,
      partner_race == 5 ~ "Native American"  ,
      partner_race == 6 ~ "Others" 
    ),
    
    goal = case_when(
      goal == 1 ~ "Fun",
      goal == 2 ~ "Meet new people",
      goal == 3 ~ "Date",
      goal == 4 ~ "Serious",
      goal == 5 ~ "To say",
      goal == 6 ~ "Other"
    ),
    
    cod_frequency_date = frequency_date
    
    ,

    frequency_date = 
      case_when(
        frequency_date == 1 ~ "Several a week",
        frequency_date == 2 ~ "Twice a week",
        frequency_date == 3 ~ "Once a week",
        frequency_date == 4 ~ "Twice a month",
        frequency_date == 5 ~ "Once a month",
        frequency_date == 6 ~ "Several a year",
        frequency_date == 7 ~ "Never"
      ) %>% 
      factor(
        level = c(
          "Several a week",
          "Twice a week",
          "Once a week",
          "Twice a month",
          "Once a month",
          "Several a year",
          "Never"
        ),
        ordered = TRUE
      ) 
    ,
    
    
    frequency_go_out = 
      case_when(
        frequency_go_out == 1 ~ "Several a week",
        frequency_go_out   == 2 ~ "Twice a week",
        frequency_date == 3 ~ "Once a week",
        frequency_date == 4 ~ "Twice a month",
        frequency_date == 5 ~ "Once a month",
        frequency_date == 6 ~ "Several a year",
        frequency_date == 7 ~ "Never"
      ) %>% 
      factor(
        level = c(
          "Several a week",
          "Twice a week",
          "Once a week",
          "Twice a month",
          "Once a month",
          "Several a year",
          "Never"
        ),
        ordered = TRUE
      ) ,

    
    
    career = str_to_title(career),

    career_macro = 
      case_when(
        career_macro == 1 ~ "Lawyer",
        career_macro == 2 ~ "Academic/Research",
        career_macro == 3 ~ "Psychologist" ,
        career_macro == 4 ~ "Doctor/Medicine" ,
        career_macro == 5 ~ "Engineer" ,
        career_macro == 6 ~ "Creative Arts/Entertainment" ,
        career_macro == 7 ~ "Banking/Consulting/Finance/Marketing/Business/CEO/Entrepreneur/Admin" ,
        career_macro == 8 ~ "Real Estate" ,
        career_macro == 9 ~ "International/Humanitarian Affairs" ,
        career_macro == 10 ~ "Undecided" ,
        career_macro == 11 ~ "Social Work",
        career_macro == 12 ~ "Speech Pathology",
        career_macro == 13 ~ "Politics",
        career_macro == 14 ~ "Pro sports/Athletics",
        career_macro == 15 ~ "Other",
        career_macro == 16 ~ "Journalism",
        career_macro == 17 ~ "Architecture"
    ),
    
    met_before = if_else(met_before == 1, TRUE, FALSE),
    
    opinion_duration_of_date = case_when(
      opinion_duration_of_date == 1 ~ "Too little",
      opinion_duration_of_date == 2 ~ "Too much",
      opinion_duration_of_date == 3 ~ "Just Right",
    ),
    
    opinion_num_dates = case_when(
      opinion_num_dates == 1 ~ "Too few",
      opinion_num_dates == 2 ~ "Too many"
    ),
    
    have_you_dated = case_when(
      have_you_dated == 1 ~ TRUE,
      have_you_dated == 2 ~ FALSE
    )
    ,
    sex = if_else(male > 0, "Homem", "Mulher") %>%  as_factor()
  ) %>% 
  select(
    match,
    unique_id_number,
    id_within_wave,
    sex,
    subject_within_gender,
    choice,
    n_people_met_in_wave,
    position_meeting,
    position_started,
    order_meeting,
    partnet_id_within_wave,
    partner_unique_id_number,
    interests_correlation,
    same_race,
    my_age,
    partner_age,
    partner_race,
    partner_stated_pref_time0_attractive,
    partner_stated_pref_time0_sincere,
    partner_stated_pref_time0_intelligent,
    partner_stated_pref_time0_fun,
    partner_stated_pref_time0_ambitious,
    partner_stated_pref_time0_shared_interests,
    importance_same_race,
    importance_same_religion,
    income_zipcode,
    frequency_date,
    frequency_go_out,
    career_macro,
    happy_expec,
    n_expect_like_you,
    partner_liked_me,
    i_liked_partner,
    i_found_partner__attractive,
    i_found_partner__sincere,
    i_found_partner__intelligent,
    i_found_partner__fun,
    i_found_partner__ambitious,
    i_found_partner__interests,
    partner_found_me__attractive,
    partner_found_me__sincere,
    partner_found_me__intelligent,
    partner_found_me__fun,
    partner_found_me__ambitious,
    partner_found_me__interests,
    probability_i_find_partner_liked_me,
    met_before,
    opinion_duration_of_date,
    opinion_num_dates,
    starts_with("competitors_look_for__"),
    starts_with("you_look_for__"),
    starts_with("opposite_sex_look_for__"),
    starts_with("others_perceive_you__"),
    starts_with("you_perceive_yourself__"),
    starts_with("actual_importance__"),
    choice,
    race,
    goal,
    frequency_date,
    career_macro,
    met_before,
    opinion_duration_of_date,
    opinion_num_dates,
  ) %>% 
  mutate(
    across(
      .cols = where(is.character),
      .fns = as.factor
    )
  ) %>% 
  mutate(
    across(
      .cols = c(match, same_race, partner_liked_me, i_liked_partner) ,
      .fns = as.logical
    )
  ) 


glimpse(dados_speed_date_fatores)
## Rows: 8,378
## Columns: 79
## $ match                                      <lgl> FALSE, FALSE, TRUE, TRUE...
## $ unique_id_number                           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ id_within_wave                             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ sex                                        <fct> Mulher, Mulher, Mulher, ...
## $ subject_within_gender                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ choice                                     <fct> limited, limited, limite...
## $ n_people_met_in_wave                       <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting                           <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ position_started                           <lgl> NA, NA, NA, NA, NA, NA, ...
## $ order_meeting                              <dbl> 4, 3, 10, 5, 7, 6, 1, 2,...
## $ partnet_id_within_wave                     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, ...
## $ partner_unique_id_number                   <dbl> 11, 12, 13, 14, 15, 16, ...
## $ interests_correlation                      <dbl> 0.14, 0.54, 0.16, 0.61, ...
## $ same_race                                  <lgl> FALSE, FALSE, TRUE, FALS...
## $ my_age                                     <dbl> 21, 21, 21, 21, 21, 21, ...
## $ partner_age                                <dbl> 27, 22, 22, 23, 24, 25, ...
## $ partner_race                               <fct> White, White, Asian, Whi...
## $ partner_stated_pref_time0_attractive       <dbl> 35.00, 60.00, 19.00, 30....
## $ partner_stated_pref_time0_sincere          <dbl> 20.00, 0.00, 18.00, 5.00...
## $ partner_stated_pref_time0_intelligent      <dbl> 20.00, 0.00, 19.00, 15.0...
## $ partner_stated_pref_time0_fun              <dbl> 20.00, 40.00, 18.00, 40....
## $ partner_stated_pref_time0_ambitious        <dbl> 0.00, 0.00, 14.00, 5.00,...
## $ partner_stated_pref_time0_shared_interests <dbl> 5.00, 0.00, 12.00, 5.00,...
## $ importance_same_race                       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ importance_same_religion                   <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ income_zipcode                             <dbl> 69487, 69487, 69487, 694...
## $ frequency_date                             <ord> Never, Never, Never, Nev...
## $ frequency_go_out                           <ord> Several a week, Several ...
## $ career_macro                               <fct> NA, NA, NA, NA, NA, NA, ...
## $ happy_expec                                <dbl> 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ n_expect_like_you                          <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ partner_liked_me                           <lgl> FALSE, FALSE, TRUE, TRUE...
## $ i_liked_partner                            <lgl> TRUE, TRUE, TRUE, TRUE, ...
## $ i_found_partner__attractive                <dbl> 6, 7, 5, 7, 5, 4, 7, 4, ...
## $ i_found_partner__sincere                   <dbl> 9, 8, 8, 6, 6, 9, 6, 9, ...
## $ i_found_partner__intelligent               <dbl> 7, 7, 9, 8, 7, 7, 7, 7, ...
## $ i_found_partner__fun                       <dbl> 7, 8, 8, 7, 7, 4, 4, 6, ...
## $ i_found_partner__ambitious                 <dbl> 6, 5, 5, 6, 6, 6, 6, 5, ...
## $ i_found_partner__interests                 <dbl> 5, 6, 7, 8, 6, 4, 7, 6, ...
## $ partner_found_me__attractive               <dbl> 6, 7, 10, 7, 8, 7, 3, 6,...
## $ partner_found_me__sincere                  <dbl> 8, 8, 10, 8, 7, 7, 6, 7,...
## $ partner_found_me__intelligent              <dbl> 8, 10, 10, 9, 9, 8, 7, 5...
## $ partner_found_me__fun                      <dbl> 8, 7, 10, 8, 6, 8, 5, 6,...
## $ partner_found_me__ambitious                <dbl> 8, 7, 10, 9, 9, 7, 8, 8,...
## $ partner_found_me__interests                <dbl> 6, 5, 10, 8, 7, 7, 7, 6,...
## $ probability_i_find_partner_liked_me        <dbl> 6, 5, NA, 6, 6, 5, 5, 7,...
## $ met_before                                 <lgl> FALSE, TRUE, TRUE, FALSE...
## $ opinion_duration_of_date                   <fct> Too much, Too much, Too ...
## $ opinion_num_dates                          <fct> Too few, Too few, Too fe...
## $ competitors_look_for__attractive           <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__sincere              <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__intelligent          <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__fun                  <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__ambitious            <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__shared_interests     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_look_for__attractive                   <dbl> 15, 15, 15, 15, 15, 15, ...
## $ you_look_for__sincere                      <dbl> 20, 20, 20, 20, 20, 20, ...
## $ you_look_for__intelligent                  <dbl> 20, 20, 20, 20, 20, 20, ...
## $ you_look_for__fun                          <dbl> 15, 15, 15, 15, 15, 15, ...
## $ you_look_for__ambitious                    <dbl> 15, 15, 15, 15, 15, 15, ...
## $ you_look_for__shared_interests             <dbl> 15, 15, 15, 15, 15, 15, ...
## $ others_perceive_you__attractive            <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__sincere               <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__intelligent           <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__fun                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__ambitious             <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_perceive_yourself__attractive          <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ you_perceive_yourself__sincere             <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__fun                 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__intelligent         <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__ambitious           <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ actual_importance__attractive              <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__sincere                 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__intelligent             <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__fun                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__ambitious               <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__shared_interests        <lgl> NA, NA, NA, NA, NA, NA, ...
## $ race                                       <fct> Asian, Asian, Asian, Asi...
## $ goal                                       <fct> Meet new people, Meet ne...

Compatibilizando normalizações diferentes

Algumas perguntas foram feitas de forma inconsistente ao longo dos dias da pesquisa.

Em alguns dias foi dado um orçamento de x pontos para os entrevistados distribuírem nos atributos de mesmo tipo, em outros foi dado um orçamento pra cada atributo.

normaliza_no_prefixo <-  function(
  df = dados_com_representacao , 
  prefixo = "partner_stated_pref_time0_" ){
  

  dados_speed_date_normalizada <- df %>%
    rowwise() %>% 
    mutate(
      "{prefixo}_soma" := 
        sum(c_across(starts_with(prefixo)), na.rm = TRUE)
    ) %>% 
    mutate(
      across(
        .cols = starts_with(prefixo),
        .fns = ~.x / .data[[str_glue("{prefixo}_soma")]]
      )
    ) %>% 
    select(
      -contains(str_glue("{prefixo}_soma"))
    ) 
    
}

dados_speed_date_normalizada <- dados_speed_date_fatores %>%  
  normaliza_no_prefixo("partner_stated_pref_time0_" ) %>% 
  normaliza_no_prefixo("you_look_for__" ) %>% 
  normaliza_no_prefixo("opposite_sex_look_for__" ) %>% 
  ungroup()


glimpse(dados_speed_date_normalizada)
## Rows: 8,378
## Columns: 79
## $ match                                      <lgl> FALSE, FALSE, TRUE, TRUE...
## $ unique_id_number                           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ id_within_wave                             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ sex                                        <fct> Mulher, Mulher, Mulher, ...
## $ subject_within_gender                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ choice                                     <fct> limited, limited, limite...
## $ n_people_met_in_wave                       <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting                           <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ position_started                           <lgl> NA, NA, NA, NA, NA, NA, ...
## $ order_meeting                              <dbl> 4, 3, 10, 5, 7, 6, 1, 2,...
## $ partnet_id_within_wave                     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, ...
## $ partner_unique_id_number                   <dbl> 11, 12, 13, 14, 15, 16, ...
## $ interests_correlation                      <dbl> 0.14, 0.54, 0.16, 0.61, ...
## $ same_race                                  <lgl> FALSE, FALSE, TRUE, FALS...
## $ my_age                                     <dbl> 21, 21, 21, 21, 21, 21, ...
## $ partner_age                                <dbl> 27, 22, 22, 23, 24, 25, ...
## $ partner_race                               <fct> White, White, Asian, Whi...
## $ partner_stated_pref_time0_attractive       <dbl> 0.3500000, 0.6000000, 0....
## $ partner_stated_pref_time0_sincere          <dbl> 0.2000000, 0.0000000, 0....
## $ partner_stated_pref_time0_intelligent      <dbl> 0.2000000, 0.0000000, 0....
## $ partner_stated_pref_time0_fun              <dbl> 0.2000000, 0.4000000, 0....
## $ partner_stated_pref_time0_ambitious        <dbl> 0.0000000, 0.0000000, 0....
## $ partner_stated_pref_time0_shared_interests <dbl> 0.0500000, 0.0000000, 0....
## $ importance_same_race                       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ importance_same_religion                   <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ income_zipcode                             <dbl> 69487, 69487, 69487, 694...
## $ frequency_date                             <ord> Never, Never, Never, Nev...
## $ frequency_go_out                           <ord> Several a week, Several ...
## $ career_macro                               <fct> NA, NA, NA, NA, NA, NA, ...
## $ happy_expec                                <dbl> 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ n_expect_like_you                          <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ partner_liked_me                           <lgl> FALSE, FALSE, TRUE, TRUE...
## $ i_liked_partner                            <lgl> TRUE, TRUE, TRUE, TRUE, ...
## $ i_found_partner__attractive                <dbl> 6, 7, 5, 7, 5, 4, 7, 4, ...
## $ i_found_partner__sincere                   <dbl> 9, 8, 8, 6, 6, 9, 6, 9, ...
## $ i_found_partner__intelligent               <dbl> 7, 7, 9, 8, 7, 7, 7, 7, ...
## $ i_found_partner__fun                       <dbl> 7, 8, 8, 7, 7, 4, 4, 6, ...
## $ i_found_partner__ambitious                 <dbl> 6, 5, 5, 6, 6, 6, 6, 5, ...
## $ i_found_partner__interests                 <dbl> 5, 6, 7, 8, 6, 4, 7, 6, ...
## $ partner_found_me__attractive               <dbl> 6, 7, 10, 7, 8, 7, 3, 6,...
## $ partner_found_me__sincere                  <dbl> 8, 8, 10, 8, 7, 7, 6, 7,...
## $ partner_found_me__intelligent              <dbl> 8, 10, 10, 9, 9, 8, 7, 5...
## $ partner_found_me__fun                      <dbl> 8, 7, 10, 8, 6, 8, 5, 6,...
## $ partner_found_me__ambitious                <dbl> 8, 7, 10, 9, 9, 7, 8, 8,...
## $ partner_found_me__interests                <dbl> 6, 5, 10, 8, 7, 7, 7, 6,...
## $ probability_i_find_partner_liked_me        <dbl> 6, 5, NA, 6, 6, 5, 5, 7,...
## $ met_before                                 <lgl> FALSE, TRUE, TRUE, FALSE...
## $ opinion_duration_of_date                   <fct> Too much, Too much, Too ...
## $ opinion_num_dates                          <fct> Too few, Too few, Too fe...
## $ competitors_look_for__attractive           <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__sincere              <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__intelligent          <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__fun                  <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__ambitious            <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__shared_interests     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_look_for__attractive                   <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ you_look_for__sincere                      <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__intelligent                  <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__fun                          <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ you_look_for__ambitious                    <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ you_look_for__shared_interests             <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ others_perceive_you__attractive            <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__sincere               <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__intelligent           <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__fun                   <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__ambitious             <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_perceive_yourself__attractive          <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ you_perceive_yourself__sincere             <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__fun                 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__intelligent         <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__ambitious           <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ actual_importance__attractive              <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__sincere                 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__intelligent             <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__fun                     <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__ambitious               <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__shared_interests        <lgl> NA, NA, NA, NA, NA, NA, ...
## $ race                                       <fct> Asian, Asian, Asian, Asi...
## $ goal                                       <fct> Meet new people, Meet ne...

Estatísticas sobre o resumo dos dados

A biblioteca skim, com a função skimr(), oferece uma boa forma de ver um resumo com a característica dos dados

skim(dados_speed_date_normalizada)
Data summary
Name dados_speed_date_normaliz…
Number of rows 8378
Number of columns 79
_______________________
Column type frequency:
factor 10
logical 23
numeric 46
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1.00 FALSE 2 Hom: 4194, Mul: 4184
choice 0 1.00 FALSE 2 ext: 6944, lim: 1434
partner_race 73 0.99 FALSE 5 Whi: 4722, Asi: 1978, Lat: 664, Oth: 521
frequency_date 97 0.99 TRUE 7 Sev: 2094, Twi: 2040, Onc: 1528, Nev: 1434
frequency_go_out 2778 0.67 TRUE 2 Twi: 2990, Sev: 2610, Onc: 0, Twi: 0
career_macro 138 0.98 FALSE 17 Aca: 2320, Ban: 2170, Cre: 724, Law: 675
opinion_duration_of_date 915 0.89 FALSE 3 Too: 4227, Jus: 3059, Too: 177
opinion_num_dates 4107 0.51 FALSE 2 Too: 3622, Too: 649
race 63 0.99 FALSE 5 Whi: 4727, Asi: 1982, Lat: 664, Oth: 522
goal 79 0.99 FALSE 6 Fun: 3426, Mee: 3012, Dat: 631, To : 510

Variable type: logical

skim_variable n_missing complete_rate mean count
match 0 1.00 0.16 FAL: 6998, TRU: 1380
position_started 7974 0.05 1.00 TRU: 404
same_race 0 1.00 0.40 FAL: 5062, TRU: 3316
partner_liked_me 0 1.00 0.42 FAL: 4863, TRU: 3515
i_liked_partner 0 1.00 0.42 FAL: 4860, TRU: 3518
met_before 375 0.96 0.04 FAL: 7652, TRU: 351
competitors_look_for__attractive 8378 0.00 NaN :
competitors_look_for__sincere 7997 0.05 0.05 FAL: 363, TRU: 18
competitors_look_for__intelligent 8204 0.02 0.28 FAL: 125, TRU: 49
competitors_look_for__fun 8319 0.01 0.31 FAL: 41, TRU: 18
competitors_look_for__ambitious 7693 0.08 0.18 FAL: 563, TRU: 122
competitors_look_for__shared_interests 8059 0.04 0.15 FAL: 271, TRU: 48
others_perceive_you__attractive 8378 0.00 NaN :
others_perceive_you__sincere 8368 0.00 1.00 TRU: 10
others_perceive_you__intelligent 8378 0.00 NaN :
others_perceive_you__fun 8378 0.00 NaN :
others_perceive_you__ambitious 8363 0.00 1.00 TRU: 15
actual_importance__attractive 8378 0.00 NaN :
actual_importance__sincere 8205 0.02 0.00 FAL: 173
actual_importance__intelligent 8297 0.01 0.00 FAL: 81
actual_importance__fun 8344 0.00 0.00 FAL: 34
actual_importance__ambitious 7842 0.06 0.00 FAL: 536
actual_importance__shared_interests 8165 0.03 0.00 FAL: 213

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
unique_id_number 0 1.00 283.68 158.58 1.00 154.00 281.00 407.00 552.00 ▇▇▇▇▇
id_within_wave 1 1.00 8.96 5.49 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
subject_within_gender 0 1.00 17.33 10.94 1.00 8.00 16.00 26.00 44.00 ▇▇▅▅▂
n_people_met_in_wave 0 1.00 16.87 4.36 5.00 14.00 18.00 20.00 22.00 ▁▃▂▅▇
position_meeting 0 1.00 9.04 5.51 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
order_meeting 0 1.00 8.93 5.48 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
partnet_id_within_wave 0 1.00 8.96 5.49 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
partner_unique_id_number 10 1.00 283.86 158.58 1.00 154.00 281.00 408.00 552.00 ▇▇▇▇▇
interests_correlation 158 0.98 0.20 0.30 -0.83 -0.02 0.21 0.43 0.91 ▁▃▇▇▂
my_age 95 0.99 26.36 3.57 18.00 24.00 26.00 28.00 55.00 ▇▇▁▁▁
partner_age 104 0.99 26.36 3.56 18.00 24.00 26.00 28.00 55.00 ▇▇▁▁▁
partner_stated_pref_time0_attractive 89 0.99 0.22 0.13 0.00 0.15 0.20 0.25 1.00 ▇▃▁▁▁
partner_stated_pref_time0_sincere 89 0.99 0.17 0.07 0.00 0.15 0.18 0.20 0.60 ▃▇▂▁▁
partner_stated_pref_time0_intelligent 89 0.99 0.20 0.07 0.00 0.17 0.20 0.23 0.50 ▂▇▃▁▁
partner_stated_pref_time0_fun 98 0.99 0.17 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
partner_stated_pref_time0_ambitious 107 0.99 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▇▇▇▁▁
partner_stated_pref_time0_shared_interests 129 0.98 0.12 0.06 0.00 0.10 0.11 0.16 0.30 ▆▇▇▃▁
importance_same_race 79 0.99 3.78 2.85 0.00 1.00 3.00 6.00 10.00 ▇▃▂▂▂
importance_same_religion 79 0.99 3.65 2.81 1.00 1.00 3.00 6.00 10.00 ▇▃▃▂▁
income_zipcode 4099 0.51 44887.61 17206.92 8607.00 31516.00 43185.00 54303.00 109031.00 ▃▇▅▂▁
happy_expec 101 0.99 5.53 1.73 1.00 5.00 6.00 7.00 10.00 ▁▃▇▃▁
n_expect_like_you 6578 0.21 5.57 4.76 0.00 2.00 4.00 8.00 20.00 ▇▃▂▁▁
i_found_partner__attractive 202 0.98 6.19 1.95 0.00 5.00 6.00 8.00 10.00 ▁▃▇▇▂
i_found_partner__sincere 277 0.97 7.18 1.74 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__intelligent 296 0.96 7.37 1.55 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__fun 350 0.96 6.40 1.95 0.00 5.00 7.00 8.00 10.00 ▁▂▇▇▂
i_found_partner__ambitious 712 0.92 6.78 1.79 0.00 6.00 7.00 8.00 10.00 ▁▂▆▇▃
i_found_partner__interests 1067 0.87 5.47 2.16 0.00 4.00 6.00 7.00 10.00 ▂▅▇▆▂
partner_found_me__attractive 212 0.97 6.19 1.95 0.00 5.00 6.00 8.00 10.50 ▁▃▇▇▂
partner_found_me__sincere 287 0.97 7.18 1.74 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__intelligent 306 0.96 7.37 1.55 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__fun 360 0.96 6.40 1.95 0.00 5.00 7.00 8.00 11.00 ▁▂▇▇▂
partner_found_me__ambitious 722 0.91 6.78 1.79 0.00 6.00 7.00 8.00 10.00 ▁▂▆▇▃
partner_found_me__interests 1076 0.87 5.47 2.16 0.00 4.00 6.00 7.00 10.00 ▂▅▇▆▂
probability_i_find_partner_liked_me 309 0.96 5.21 2.13 0.00 4.00 5.00 7.00 10.00 ▂▅▇▅▁
you_look_for__attractive 79 0.99 0.23 0.13 0.00 0.15 0.20 0.25 1.00 ▇▃▁▁▁
you_look_for__sincere 79 0.99 0.17 0.07 0.00 0.15 0.18 0.20 0.60 ▃▇▂▁▁
you_look_for__intelligent 79 0.99 0.20 0.07 0.00 0.17 0.20 0.23 0.50 ▂▇▃▁▁
you_look_for__fun 89 0.99 0.17 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
you_look_for__ambitious 99 0.99 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▇▇▇▁▁
you_look_for__shared_interests 121 0.99 0.12 0.06 0.00 0.10 0.11 0.16 0.30 ▆▇▇▃▁
you_perceive_yourself__attractive 105 0.99 7.08 1.40 2.00 6.00 7.00 8.00 10.00 ▁▂▂▇▂
you_perceive_yourself__sincere 105 0.99 8.29 1.41 2.00 8.00 8.00 9.00 10.00 ▁▁▁▆▇
you_perceive_yourself__fun 105 0.99 7.70 1.56 2.00 7.00 8.00 9.00 10.00 ▁▁▂▇▆
you_perceive_yourself__intelligent 105 0.99 8.40 1.08 3.00 8.00 8.00 9.00 10.00 ▁▁▃▆▇
you_perceive_yourself__ambitious 105 0.99 7.58 1.78 2.00 7.00 8.00 9.00 10.00 ▁▂▂▇▆

Uma boa forma de ver um resumo dos dados

A função skim() devolve um tibble, que pode ser usado para extrair estatísticas da base

resumo <- skim(dados_speed_date_normalizada)


glimpse(resumo)
## Rows: 79
## Columns: 17
## $ skim_type         <chr> "factor", "factor", "factor", "factor", "factor",...
## $ skim_variable     <chr> "sex", "choice", "partner_race", "frequency_date"...
## $ n_missing         <int> 0, 0, 73, 97, 2778, 138, 915, 4107, 63, 79, 0, 79...
## $ complete_rate     <dbl> 1.000000000, 1.000000000, 0.991286703, 0.98842205...
## $ factor.ordered    <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FA...
## $ factor.n_unique   <int> 2, 2, 5, 7, 2, 17, 3, 2, 5, 6, NA, NA, NA, NA, NA...
## $ factor.top_counts <chr> "Hom: 4194, Mul: 4184", "ext: 6944, lim: 1434", "...
## $ logical.mean      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.1647171...
## $ logical.count     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "FAL: 699...
## $ numeric.mean      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.sd        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p0        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p25       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p50       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p75       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p100      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.hist      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...

Podemos ver que temos muito campos quase completos e alguns campos bem menos preenchidos.

De modo geral, são campos que foram preenchidos numa pesquisa feita semanas depois do evento.

ggplot(resumo) +
  geom_density(
    aes(
      x = complete_rate
    ),
    adjust = 0.1
  ) +
  theme_minimal()

Retiramos, então os dados com pouca representação

campos_com_representacao <-  resumo %>% 
  filter(
    complete_rate > 0.75
  )


dados_com_representacao <-  dados_speed_date_normalizada %>% 
  select(
    campos_com_representacao$skim_variable
  )

Adicionando dados do parceiro

Queremos ter algumas impressões do parceiro no nosso conjunto de dados, e assim fazemos o resumo final para começar a brincar com os dados.

dados_speed_date_partner_side <- dados_speed_date_normalizada %>% 
  select(
    unique_id_number, 
    partner_unique_id_number,
    probability_partner_find_i_liked_partner = probability_i_find_partner_liked_me,
    partner_career_macro = career_macro,
    starts_with("you_perceive_yourself__")
  ) %>% 
  rename_with(
    .cols = starts_with("you_perceive_yourself__"),
    .fn = ~str_replace(.x, "you_perceive_yourself__", "partner_perceives_himself__")
  )


dados_finais <- dados_com_representacao %>% 
  left_join(
    dados_speed_date_partner_side,
    by = c("unique_id_number" = "partner_unique_id_number", "partner_unique_id_number" = "unique_id_number"  )
  ) %>% 
  filter(
    across(
      .cols = everything(),
      .fns = ~!is.na(.x)
    )
  ) %>% 
  mutate(
    across(
      .cols = where(is.logical) ,
      .fns = as.numeric
    )
  )
  



resumo_com_representacao <-  skim(dados_finais)

resumo_com_representacao
Data summary
Name dados_finais
Number of rows 4885
Number of columns 64
_______________________
Column type frequency:
factor 9
numeric 55
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1 FALSE 2 Hom: 2456, Mul: 2429
choice 0 1 FALSE 2 ext: 4099, lim: 786
partner_race 0 1 FALSE 5 Whi: 2683, Asi: 1210, Lat: 386, Oth: 357
frequency_date 0 1 TRUE 7 Twi: 1301, Sev: 1242, Onc: 910, Nev: 793
career_macro 0 1 FALSE 17 Aca: 1472, Ban: 1236, Cre: 419, Law: 401
opinion_duration_of_date 0 1 FALSE 3 Too: 2786, Jus: 1996, Too: 103
race 0 1 FALSE 5 Whi: 2687, Asi: 1190, Lat: 406, Oth: 372
goal 0 1 FALSE 6 Fun: 2054, Mee: 1793, Dat: 372, To : 273
partner_career_macro 0 1 FALSE 17 Aca: 1396, Ban: 1292, Cre: 408, Law: 379

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
match 0 1 0.18 0.38 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
same_race 0 1 0.39 0.49 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
partner_liked_me 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
i_liked_partner 0 1 0.45 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
met_before 0 1 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
unique_id_number 0 1 283.07 156.88 4.00 160.00 274.00 411.00 552.00 ▇▆▇▆▇
id_within_wave 0 1 9.10 5.57 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
subject_within_gender 0 1 17.61 11.09 1.00 8.00 16.00 26.00 44.00 ▇▇▅▅▂
n_people_met_in_wave 0 1 17.01 4.33 5.00 14.00 18.00 20.00 22.00 ▁▂▂▅▇
position_meeting 0 1 9.13 5.50 1.00 4.00 9.00 13.00 22.00 ▇▆▅▅▂
order_meeting 0 1 8.87 5.46 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
partnet_id_within_wave 0 1 9.12 5.51 1.00 5.00 9.00 13.00 22.00 ▇▆▅▃▂
partner_unique_id_number 0 1 282.91 156.98 4.00 158.00 274.00 411.00 552.00 ▇▆▇▆▇
interests_correlation 0 1 0.20 0.30 -0.83 -0.02 0.22 0.43 0.91 ▁▃▇▇▂
my_age 0 1 26.16 3.44 18.00 24.00 26.00 28.00 55.00 ▇▇▁▁▁
partner_age 0 1 26.19 3.41 18.00 24.00 26.00 28.00 55.00 ▇▇▁▁▁
partner_stated_pref_time0_attractive 0 1 0.22 0.11 0.00 0.15 0.20 0.25 1.00 ▇▃▁▁▁
partner_stated_pref_time0_sincere 0 1 0.18 0.07 0.00 0.15 0.18 0.20 0.47 ▂▇▇▁▁
partner_stated_pref_time0_intelligent 0 1 0.20 0.07 0.00 0.18 0.20 0.24 0.50 ▁▇▃▁▁
partner_stated_pref_time0_fun 0 1 0.17 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
partner_stated_pref_time0_ambitious 0 1 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▆▇▇▁▁
partner_stated_pref_time0_shared_interests 0 1 0.12 0.06 0.00 0.10 0.11 0.16 0.30 ▅▇▇▃▁
importance_same_race 0 1 3.83 2.83 1.00 1.00 3.00 6.00 10.00 ▇▃▃▂▂
importance_same_religion 0 1 3.61 2.85 1.00 1.00 3.00 6.00 10.00 ▇▂▃▂▂
happy_expec 0 1 5.49 1.78 1.00 5.00 6.00 7.00 10.00 ▁▃▇▅▁
i_found_partner__attractive 0 1 6.25 1.94 0.00 5.00 6.00 8.00 10.00 ▁▃▇▇▂
i_found_partner__sincere 0 1 7.22 1.72 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__intelligent 0 1 7.43 1.52 0.00 7.00 8.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__fun 0 1 6.48 1.94 0.00 5.00 7.00 8.00 10.00 ▁▂▇▇▃
i_found_partner__ambitious 0 1 6.82 1.79 0.00 6.00 7.00 8.00 10.00 ▁▁▆▇▃
i_found_partner__interests 0 1 5.55 2.14 0.00 4.00 6.00 7.00 10.00 ▂▅▇▆▂
partner_found_me__attractive 0 1 6.21 1.93 0.00 5.00 6.00 8.00 10.00 ▁▃▇▇▂
partner_found_me__sincere 0 1 7.17 1.74 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__intelligent 0 1 7.39 1.53 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__fun 0 1 6.43 1.94 0.00 5.00 7.00 8.00 11.00 ▁▂▇▇▂
partner_found_me__ambitious 0 1 6.76 1.79 0.00 6.00 7.00 8.00 10.00 ▁▂▆▇▃
partner_found_me__interests 0 1 5.50 2.13 0.00 4.00 6.00 7.00 10.00 ▂▅▇▆▂
probability_i_find_partner_liked_me 0 1 5.33 2.12 0.00 4.00 5.00 7.00 10.00 ▂▅▇▅▁
you_look_for__attractive 0 1 0.22 0.11 0.00 0.15 0.20 0.25 0.90 ▇▇▁▁▁
you_look_for__sincere 0 1 0.18 0.07 0.00 0.15 0.18 0.20 0.47 ▁▇▇▁▁
you_look_for__intelligent 0 1 0.20 0.07 0.00 0.18 0.20 0.23 0.50 ▁▇▃▁▁
you_look_for__fun 0 1 0.17 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
you_look_for__ambitious 0 1 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▆▇▇▁▁
you_look_for__shared_interests 0 1 0.12 0.06 0.00 0.10 0.11 0.16 0.30 ▆▇▇▃▁
you_perceive_yourself__attractive 0 1 7.14 1.41 2.00 6.00 7.00 8.00 10.00 ▁▂▂▇▂
you_perceive_yourself__sincere 0 1 8.33 1.44 2.00 8.00 9.00 9.00 10.00 ▁▁▁▆▇
you_perceive_yourself__fun 0 1 7.77 1.58 2.00 7.00 8.00 9.00 10.00 ▁▁▂▇▆
you_perceive_yourself__intelligent 0 1 8.50 1.08 3.00 8.00 9.00 9.00 10.00 ▁▁▂▅▇
you_perceive_yourself__ambitious 0 1 7.65 1.83 2.00 7.00 8.00 9.00 10.00 ▁▂▂▇▇
probability_partner_find_i_liked_partner 0 1 5.27 2.12 0.00 4.00 5.00 7.00 10.00 ▂▅▇▅▁
partner_perceives_himself__attractive 0 1 7.11 1.40 2.00 6.00 7.00 8.00 10.00 ▁▂▂▇▂
partner_perceives_himself__sincere 0 1 8.33 1.44 2.00 8.00 9.00 9.00 10.00 ▁▁▁▆▇
partner_perceives_himself__fun 0 1 7.73 1.57 2.00 7.00 8.00 9.00 10.00 ▁▁▂▇▆
partner_perceives_himself__intelligent 0 1 8.46 1.08 3.00 8.00 8.00 9.00 10.00 ▁▁▂▆▇
partner_perceives_himself__ambitious 0 1 7.63 1.82 2.00 7.00 8.00 9.00 10.00 ▁▂▂▇▇

Alguma análise exploratória

Podemos ver, por exemplo, se as pessoas têm uma imagem acurada da própria atratividade

escala_sexo = c(Homem = "darkblue", Mulher = "darkred")


dados_finais %>% 
  ggplot(
    aes(
      y = partner_found_me__attractive,
      x = you_perceive_yourself__attractive
    )
  ) +
  geom_boxplot(
    aes(
      group = you_perceive_yourself__attractive,
      color = sex,
      fill = sex,
      alpha = 0.3
    ),
    show.legend = FALSE
  ) +
  scale_color_manual(
    values = escala_sexo
  ) +
  scale_fill_manual(
    values = escala_sexo
  ) +
  stat_smooth(
    method = "loess",
    formula = y ~ x,
    show.legend = FALSE,
    se = FALSE,
    aes(
      color = sex
    )
    
  ) +
  geom_function(
    fun = identity 
  ) +
  facet_wrap(
    ~sex
  ) +
  scale_x_continuous(
    breaks = 0:10
  ) +
  scale_y_continuous(
    breaks = 0:10
  ) +
  labs(
    x = "Me acho bonito",
    y = "Parceiro me acha bonito"
  ) +
  theme_minimal()

Alguma análise exploratória

Como o quanto eu achei o parceiro bom em algum atributo está correlacionado com o fato de eu gostar do parceiro?

dados_grafico_partner_liked <- dados_finais %>% 
  select(
    i_liked_partner,
    starts_with("i_found_partner__"),
    sex
  ) %>% 
  pivot_longer(
    cols = -c(i_liked_partner, sex),
    names_to = "i_found_partner",
    names_pattern = "i_found_partner__(.*)",
    values_to = "degree"
  ) %>% 
  mutate(
    degree = round(degree)
  ) %>% 
  group_by(
    degree,
    i_found_partner,
    sex
  ) %>% 
  summarise(
    i_liked_partner = mean(i_liked_partner),
    n = n()
  ) %>% 
  filter(
    n > 100
  )

  
ggplot(dados_grafico_partner_liked) +
  geom_line(
    aes(
      x = degree,
      y = i_liked_partner,
      color = sex,
    ),
    size = 1.2

  ) +
  geom_point(
    aes(
      x = degree,
      y = i_liked_partner,
      color = sex,
      size = n
    )
  ) +
  facet_wrap(
    ~i_found_partner
  ) +
  theme_minimal() +
  theme(
    legend.position = "top"
  ) +
  scale_x_continuous(
    breaks = 1:10
  ) +
  scale_y_continuous(
    limits = c(0,1),
    breaks = seq(0, to = 1, by = .2),
    labels = percent_format(accuracy = 1)
  ) +
  scale_color_manual(
    values = escala_sexo
  ) +
  labs(
    x = "Gostei deste atributo no parceiro",
    y = "Gostei do parceiro. Quero ele(a)"
  )

Biblioteca Parnsip

Na análise anterior, fizemos a média condicional variável a variável, mas podemos fazer a média condicional a todas as variáveis ao mesmo tempo.

A forma de fazer isso é rodando uma regressão linear de mínimos quadrados ordinários múltipla.

parnsnip é a sucessora do núcleo da caret.

Ela é usada para oferecer uma interface genérica a alguns tipos de modelos de aprensizado estatístico

No caso, escolhemos um modelo linear e usamos como engine a função lm do R

lm_mod <- 
  linear_reg() %>% 
  set_engine("lm")

lm_mod
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Regressão múltipla nos atributos

Agora rodamos efetivamente o modelo

Notem que o modelo é rodado com as interações entre os atributos e a dummy “sex”

A biblioteca yardstick oferece métodos para extrairmos métrica e estimações de dentro dos objetos retornados pelas funções de treinamento da parsnip, como fit()

lm_fit <- 
  lm_mod %>% 
  fit(  i_liked_partner ~ 
        sex +
        i_found_partner__attractive * sex +
        i_found_partner__ambitious * sex +
        i_found_partner__fun * sex +
        i_found_partner__intelligent * sex +
        i_found_partner__interests * sex +
        i_found_partner__sincere * sex ,

  data = dados_finais)


tidy(lm_fit)
## # A tibble: 14 x 5
##    term                                   estimate std.error statistic  p.value
##    <chr>                                     <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                           -0.395      0.0442    -8.92   6.56e-19
##  2 sexHomem                               0.0674     0.0643     1.05   2.95e- 1
##  3 i_found_partner__attractive            0.0673     0.00547   12.3    2.77e-34
##  4 i_found_partner__ambitious            -0.0205     0.00633   -3.25   1.18e- 3
##  5 i_found_partner__fun                   0.0365     0.00629    5.81   6.71e- 9
##  6 i_found_partner__intelligent           0.0189     0.00846    2.23   2.59e- 2
##  7 i_found_partner__interests             0.0479     0.00514    9.31   1.98e-20
##  8 i_found_partner__sincere              -0.0164     0.00648   -2.53   1.13e- 2
##  9 sexHomem:i_found_partner__attractive   0.0472     0.00786    6.00   2.07e- 9
## 10 sexHomem:i_found_partner__ambitious   -0.00375    0.00918   -0.409  6.83e- 1
## 11 sexHomem:i_found_partner__fun         -0.000680   0.00921   -0.0738 9.41e- 1
## 12 sexHomem:i_found_partner__intelligent -0.0312     0.0122    -2.56   1.05e- 2
## 13 sexHomem:i_found_partner__interests   -0.00185    0.00732   -0.252  8.01e- 1
## 14 sexHomem:i_found_partner__sincere     -0.00353    0.00973   -0.363  7.17e- 1

Mais fácil ver em forma de gráfico

dwplot(tidy(lm_fit), dot_args = list(size = 2, color = "darkblue"),
         whisker_args = list(color = "darkblue"),
         vline = geom_vline(xintercept = 0, colour = "darkblue", linetype = 2)) +
  theme_minimal()

Estimando a resposta com novos dados

Podemos usar a função predict() para gerar estimativas para valores de y dados novos valores de x

medias_i_found <- dados_finais %>% 
  select(
    starts_with("i_found_partner__"),
    sex
  ) %>% 
  pivot_longer(
    cols = -c(sex),
    names_to = "i_found_partner",
    names_pattern = "i_found_partner__(.*)",
    values_to = "degree"
  ) %>% 
  mutate(
    degree = as.numeric(degree)
  ) %>% 
  group_by(
    sex,
    i_found_partner
  ) %>% 
  summarise(
    p10 = quantile(degree, probs = 0.1, na.rm = TRUE),
    p90 = quantile(degree, probs = 0.9, na.rm = TRUE),
    p25 = quantile(degree, probs = 0.25, na.rm = TRUE),
    p75 = quantile(degree, probs = 0.75, na.rm = TRUE),
    p33 = quantile(degree, probs = 0.33, na.rm = TRUE),
    p67 = quantile(degree, probs = 0.67, na.rm = TRUE),
    mean = mean(degree, na.rm = TRUE)
  ) %>% 
  pivot_wider(
    names_from = i_found_partner,
    values_from = c(mean, p10, p90, p25, p75, p33, p67)
  ) 
  

med_h <- medias_i_found %>% 
  filter(
    sex == "Homem"
  )
  
med_m <- medias_i_found %>% 
  filter(
    sex == "Mulher"
  )

pontos_novos <- 
  tribble(
    ~attractive,           ~ambitious,     ~fun,      ~intelligent,      ~interests,     ~sincere,       ~sex,    ~nome,
    med_h$mean_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "Média",
    med_h$p10_attractive,  med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P10",
    med_h$p25_attractive,  med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P25",
    med_h$p90_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem",  "P90",
    med_h$p75_attractive,  med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P75",
    med_h$p33_attractive,  med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P33",
    med_h$p67_attractive,  med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P67",
    med_m$mean_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "Média",
    med_m$p10_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P10",
    med_m$p90_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P90",
    med_m$p25_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P25",
    med_m$p75_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P75",
    med_m$p33_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P33",
    med_m$p67_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P67" 
  ) %>% 
  rename_with(
    .cols = -c(sex, nome),
    .fn = ~str_glue("i_found_partner__{.x}")
  ) 
  



conf_int_pred <- predict(lm_fit, 
                         new_data = pontos_novos, 
                         type = "conf_int")

mean_pred <- predict(lm_fit, 
                         new_data = pontos_novos
                         )


dados_pred <- pontos_novos %>% 
  bind_cols(
    conf_int_pred
  ) %>% 
  bind_cols(
    mean_pred
  ) %>% 
  view()


ggplot(dados_pred, aes(x = i_found_partner__attractive)) + 
  geom_point(aes(y = .pred, color = sex)) + 
  geom_errorbar(aes(ymin = .pred_lower, 
                    ymax = .pred_upper, color = sex),
                width = .2) + 
  labs(y = "Prob. I like partner")+
  # geom_mark_circle(
  #   aes(
  #     y = .pred,
  #     label = nome,
  #     group = interaction(sex, nome),
  #     color = sex,
  #     fill = sex
  #   ),
  #   label.fontsize = 7,
  #   con.cap = 1,
  #   expand = 0.001,
  #   label.buffer = unit(1, 'mm'),
  #   show.legend = FALSE
  # ) +
  theme_minimal() +
  theme(
    legend.position = "top"
  ) +
  geom_line(
    aes(
      color = sex,
      y = .pred
    )
  ) +
  scale_color_manual(
    values = escala_sexo
  ) +
  scale_x_continuous(
    breaks = 1:10
  ) +
  scale_y_continuous(
    breaks = seq(0, to = 1, by= 0.2),
    limits = c(0,1),
    label = percent_format(accuracy = 1)
  )

Rodando um modelo mais complexo

Agora vamos sair do modelo linear e rodar uma rede neural

dados_finais_nao_nulos_sex_numerico <- dados_finais %>% 
  mutate(
    sex = if_else(sex == "Homem", 1, 0) ,
    i_liked_partner = as.numeric(i_liked_partner),
  ) %>% 
  filter(
    across(
      .cols = everything(),
      .fns = ~!is.na(.x)
    )
  )

pontos_novos_rand_for <- pontos_novos %>% 
  mutate(
    sex = if_else(sex == "Homem", 1, 0) 
  ) 
  

set.seed(192)

modelo_nnet <- mlp(mode = "regression", hidden_units = 10 ) %>%
  set_engine("nnet") 
  
modelo_nnet
## Single Layer Neural Network Specification (regression)
## 
## Main Arguments:
##   hidden_units = 10
## 
## Computational engine: nnet
fit_nnet <- modelo_nnet %>% fit(  i_liked_partner ~ 
        i_found_partner__attractive +
        i_found_partner__ambitious +
        i_found_partner__fun +
        i_found_partner__intelligent +
        i_found_partner__interests +
        i_found_partner__sincere +
        sex,
        

  data =  dados_finais_nao_nulos_sex_numerico)

fit_nnet
## parsnip model object
## 
## Fit time:  940ms 
## a 7-10-1 network with 91 weights
## inputs: i_found_partner__attractive i_found_partner__ambitious i_found_partner__fun i_found_partner__intelligent i_found_partner__interests i_found_partner__sincere sex 
## output(s): i_liked_partner 
## options were - linear output units

Relação entre variável de entrada e saída pela rede neural

No caso da rede neural, as relações não precisam ser lineares. É o caso aqui

mean_pred <- predict(fit_nnet, 
                         new_data = pontos_novos_rand_for 
                         )





dados_pred_nnet <- pontos_novos %>% 
  bind_cols(
    mean_pred
  )


ggplot(dados_pred_nnet, aes(x = i_found_partner__attractive)) + 
  geom_point(aes(y = .pred, color = sex)) + 
  labs(y = "urchin size")+
  geom_mark_circle(
    aes(
      y = .pred,
      label = nome,
      group = interaction(nome, sex),
      color = sex,
      fill = sex
    ),
    label.fontsize = 8,
    con.cap = 1,
    expand = 0.001,
    label.buffer = unit(3.5, 'mm'),
    show.legend = FALSE
  ) +
  theme_minimal() +
  theme(
    legend.position = "top"
  ) +
  geom_line(
    aes(
      color = sex,
      y = .pred
    )
  ) +
  scale_color_manual(
    values = escala_sexo
  ) +
  scale_y_continuous(
    breaks = seq(0, to = 1, by= 0.2),
    limits = c(0,1)
  ) +
  labs(y = "Prob. I like partner")

Análise exploratória pode (e deve) ser muito mais profunda

É importante fazer uma sessão de exploração, que pode ser muito mais detalhada do que a que fizemos.

A sessão de exploração nos ajuda fazer alguns testes de sanidade nos dados e a extrair alguns insights que podem ou não ser usados para construir o processo Feature Engineering que pode ajudar o modelo atingir melhores resultados.

O processo de feature engineering é o lugar onde mais podemos melhorar o tipo de modelo que vamos usar na maioria das vezes.

A dependência desse processo é menor quando usamos modelos muito complexos, de deep learning, mas para isso é necessário ter uma quantidade colossal de dados.

Escolha e avaliação do modelo: antes separar os dados de teste

Tudo o que fizermos durante o processo de seleção do modelo, como já vimos, deve ser feito nos dados de treinamento (que tambem servirão como validação).

Após a escolha de UM modelo, vamos avaliá-lo nos dados de teste.

Fonte: Feature Engineering and Selection: A Practical Approach for Predictive Models (Kuhn e Johnson)

A biblioteca rsamples oferece a infraestrutura necessária para retirar amostras dos dados disponíveis.

Usamos ela aqui para isolar os dados de teste.

Ela será usada novamente para criar as amostras usadas no cross-validation.

set.seed() é usada para manter a reprodutibilidade. Com a mesma semente, garantimos que a cada execução do script a mesma sequência (pseudo)aleatória será gerada.

O parâmetro strata garante que o balanceamento de um dos atributos (no caso o que usaremos como saída) será mantido nas duas partições.

dados_classificacao <- dados_finais %>% 
  mutate(
    i_liked_partner = if_else(i_liked_partner == 1, "Liked", "Not") %>% factor(levels = c("Liked","Not"))
  )  
  

set.seed(123)
# Put 3/4 of the data into the training set 
split_dado <- initial_split(
  data = dados_classificacao, 
  strata = i_liked_partner,
  prop = 3/4
)


# Create data frames for the two sets:
dado_treino <- training(split_dado)
dado_teste  <- testing(split_dado)
skim(dado_treino)
Data summary
Name dado_treino
Number of rows 3664
Number of columns 64
_______________________
Column type frequency:
factor 10
numeric 54
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1 FALSE 2 Hom: 1845, Mul: 1819
choice 0 1 FALSE 2 ext: 3076, lim: 588
partner_race 0 1 FALSE 5 Whi: 2015, Asi: 919, Lat: 270, Oth: 267
frequency_date 0 1 TRUE 7 Twi: 968, Sev: 920, Onc: 706, Nev: 602
career_macro 0 1 FALSE 17 Aca: 1088, Ban: 932, Cre: 320, Law: 285
opinion_duration_of_date 0 1 FALSE 3 Too: 2116, Jus: 1473, Too: 75
race 0 1 FALSE 5 Whi: 2049, Asi: 895, Lat: 302, Oth: 258
goal 0 1 FALSE 6 Fun: 1528, Mee: 1351, Dat: 281, To : 215
i_liked_partner 0 1 FALSE 2 Not: 2017, Lik: 1647
partner_career_macro 0 1 FALSE 17 Aca: 1055, Ban: 970, Cre: 303, Law: 283

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
match 0 1 0.18 0.39 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
same_race 0 1 0.40 0.49 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
partner_liked_me 0 1 0.43 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
met_before 0 1 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
unique_id_number 0 1 283.23 157.01 4.00 158.00 274.00 411.00 552.00 ▇▆▇▆▇
id_within_wave 0 1 9.14 5.60 1.00 4.00 8.00 14.00 22.00 ▇▆▅▃▂
subject_within_gender 0 1 17.69 11.17 1.00 8.00 16.00 27.00 44.00 ▇▇▅▅▂
n_people_met_in_wave 0 1 17.02 4.35 5.00 14.00 18.00 20.00 22.00 ▁▂▂▃▇
position_meeting 0 1 9.09 5.48 1.00 4.00 8.00 13.00 22.00 ▇▆▅▅▂
order_meeting 0 1 8.85 5.45 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
partnet_id_within_wave 0 1 9.23 5.49 1.00 5.00 9.00 14.00 22.00 ▇▆▅▅▂
partner_unique_id_number 0 1 283.12 156.95 4.00 158.00 274.00 411.00 552.00 ▇▆▇▆▇
interests_correlation 0 1 0.19 0.30 -0.83 -0.02 0.21 0.43 0.91 ▁▃▇▇▂
my_age 0 1 26.14 3.45 18.00 24.00 26.00 28.00 55.00 ▇▇▁▁▁
partner_age 0 1 26.19 3.35 18.00 24.00 26.00 28.00 55.00 ▇▇▁▁▁
partner_stated_pref_time0_attractive 0 1 0.22 0.11 0.00 0.15 0.20 0.25 1.00 ▇▃▁▁▁
partner_stated_pref_time0_sincere 0 1 0.18 0.07 0.00 0.15 0.18 0.20 0.47 ▁▇▇▁▁
partner_stated_pref_time0_intelligent 0 1 0.21 0.07 0.00 0.18 0.20 0.25 0.50 ▁▇▃▁▁
partner_stated_pref_time0_fun 0 1 0.17 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
partner_stated_pref_time0_ambitious 0 1 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▆▇▇▁▁
partner_stated_pref_time0_shared_interests 0 1 0.12 0.06 0.00 0.10 0.11 0.16 0.30 ▅▇▇▃▁
importance_same_race 0 1 3.87 2.85 1.00 1.00 3.00 6.00 10.00 ▇▃▃▂▂
importance_same_religion 0 1 3.60 2.83 1.00 1.00 3.00 6.00 10.00 ▇▂▃▂▁
happy_expec 0 1 5.52 1.77 1.00 5.00 6.00 7.00 10.00 ▁▃▇▅▁
i_found_partner__attractive 0 1 6.23 1.93 0.00 5.00 6.00 8.00 10.00 ▁▃▇▇▂
i_found_partner__sincere 0 1 7.19 1.73 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__intelligent 0 1 7.42 1.51 0.00 7.00 7.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__fun 0 1 6.46 1.94 0.00 5.00 7.00 8.00 10.00 ▁▂▇▇▃
i_found_partner__ambitious 0 1 6.82 1.79 0.00 6.00 7.00 8.00 10.00 ▁▁▆▇▃
i_found_partner__interests 0 1 5.54 2.14 0.00 4.00 6.00 7.00 10.00 ▂▅▇▆▂
partner_found_me__attractive 0 1 6.20 1.95 0.00 5.00 6.00 8.00 10.00 ▁▃▇▇▂
partner_found_me__sincere 0 1 7.18 1.73 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__intelligent 0 1 7.39 1.54 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__fun 0 1 6.42 1.93 0.00 5.00 7.00 8.00 11.00 ▁▂▇▇▂
partner_found_me__ambitious 0 1 6.75 1.80 0.00 6.00 7.00 8.00 10.00 ▁▂▆▇▃
partner_found_me__interests 0 1 5.51 2.14 0.00 4.00 6.00 7.00 10.00 ▂▅▇▆▂
probability_i_find_partner_liked_me 0 1 5.33 2.12 0.00 4.00 5.00 7.00 10.00 ▂▅▇▅▁
you_look_for__attractive 0 1 0.22 0.11 0.00 0.15 0.20 0.25 0.90 ▇▇▁▁▁
you_look_for__sincere 0 1 0.18 0.07 0.00 0.15 0.18 0.20 0.47 ▁▇▇▁▁
you_look_for__intelligent 0 1 0.20 0.07 0.00 0.18 0.20 0.24 0.50 ▁▇▃▁▁
you_look_for__fun 0 1 0.17 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
you_look_for__ambitious 0 1 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▆▇▇▁▁
you_look_for__shared_interests 0 1 0.12 0.06 0.00 0.10 0.11 0.16 0.30 ▅▇▇▃▁
you_perceive_yourself__attractive 0 1 7.13 1.42 2.00 6.00 7.00 8.00 10.00 ▁▂▂▇▂
you_perceive_yourself__sincere 0 1 8.33 1.43 2.00 8.00 9.00 9.00 10.00 ▁▁▁▆▇
you_perceive_yourself__fun 0 1 7.75 1.58 2.00 7.00 8.00 9.00 10.00 ▁▁▂▇▆
you_perceive_yourself__intelligent 0 1 8.50 1.07 3.00 8.00 9.00 9.00 10.00 ▁▁▂▅▇
you_perceive_yourself__ambitious 0 1 7.64 1.84 2.00 7.00 8.00 9.00 10.00 ▁▂▂▇▇
probability_partner_find_i_liked_partner 0 1 5.27 2.14 0.00 4.00 5.00 7.00 10.00 ▂▅▇▅▁
partner_perceives_himself__attractive 0 1 7.10 1.40 2.00 6.00 7.00 8.00 10.00 ▁▂▂▇▂
partner_perceives_himself__sincere 0 1 8.35 1.42 2.00 8.00 9.00 9.00 10.00 ▁▁▁▆▇
partner_perceives_himself__fun 0 1 7.72 1.57 2.00 7.00 8.00 9.00 10.00 ▁▁▂▇▆
partner_perceives_himself__intelligent 0 1 8.46 1.09 3.00 8.00 8.00 9.00 10.00 ▁▁▂▆▇
partner_perceives_himself__ambitious 0 1 7.65 1.80 2.00 7.00 8.00 9.00 10.00 ▁▂▂▇▆
skim(dado_teste)
Data summary
Name dado_teste
Number of rows 1221
Number of columns 64
_______________________
Column type frequency:
factor 10
numeric 54
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1 FALSE 2 Hom: 611, Mul: 610
choice 0 1 FALSE 2 ext: 1023, lim: 198
partner_race 0 1 FALSE 5 Whi: 668, Asi: 291, Lat: 116, Oth: 90
frequency_date 0 1 TRUE 7 Twi: 333, Sev: 322, Onc: 204, Nev: 191
career_macro 0 1 FALSE 16 Aca: 384, Ban: 304, Law: 116, Cre: 99
opinion_duration_of_date 0 1 FALSE 3 Too: 670, Jus: 523, Too: 28
race 0 1 FALSE 5 Whi: 638, Asi: 295, Oth: 114, Lat: 104
goal 0 1 FALSE 6 Fun: 526, Mee: 442, Dat: 91, Oth: 58
i_liked_partner 0 1 FALSE 2 Not: 672, Lik: 549
partner_career_macro 0 1 FALSE 17 Aca: 341, Ban: 322, Cre: 105, Law: 96

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
match 0 1 0.17 0.38 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
same_race 0 1 0.37 0.48 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
partner_liked_me 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
met_before 0 1 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
unique_id_number 0 1 282.60 156.57 4.00 160.00 274.00 408.00 552.00 ▆▆▇▆▇
id_within_wave 0 1 8.98 5.46 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
subject_within_gender 0 1 17.37 10.87 1.00 8.00 16.00 26.00 44.00 ▇▇▆▅▂
n_people_met_in_wave 0 1 16.97 4.27 5.00 15.00 18.00 20.00 22.00 ▁▃▂▅▇
position_meeting 0 1 9.23 5.53 1.00 4.00 9.00 14.00 22.00 ▇▆▅▅▂
order_meeting 0 1 8.95 5.47 1.00 4.00 8.00 13.00 22.00 ▇▆▅▅▂
partnet_id_within_wave 0 1 8.81 5.57 1.00 4.00 8.00 13.00 22.00 ▇▆▅▃▂
partner_unique_id_number 0 1 282.29 157.14 4.00 156.00 274.00 411.00 552.00 ▇▆▇▆▇
interests_correlation 0 1 0.21 0.31 -0.83 -0.01 0.23 0.44 0.91 ▁▃▇▇▂
my_age 0 1 26.19 3.39 18.00 24.00 26.00 28.00 42.00 ▂▇▅▁▁
partner_age 0 1 26.18 3.58 18.00 23.00 26.00 28.00 55.00 ▇▇▁▁▁
partner_stated_pref_time0_attractive 0 1 0.23 0.12 0.00 0.15 0.20 0.25 0.90 ▆▇▂▁▁
partner_stated_pref_time0_sincere 0 1 0.17 0.07 0.00 0.15 0.18 0.20 0.47 ▂▇▇▁▁
partner_stated_pref_time0_intelligent 0 1 0.20 0.07 0.00 0.17 0.20 0.23 0.50 ▂▇▃▁▁
partner_stated_pref_time0_fun 0 1 0.18 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
partner_stated_pref_time0_ambitious 0 1 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▇▇▇▁▁
partner_stated_pref_time0_shared_interests 0 1 0.12 0.06 0.00 0.08 0.11 0.16 0.30 ▆▇▇▅▁
importance_same_race 0 1 3.68 2.79 1.00 1.00 3.00 6.00 10.00 ▇▃▂▂▁
importance_same_religion 0 1 3.66 2.90 1.00 1.00 3.00 6.00 10.00 ▇▃▂▂▂
happy_expec 0 1 5.43 1.82 1.00 4.00 6.00 7.00 10.00 ▁▃▇▅▁
i_found_partner__attractive 0 1 6.29 1.97 0.00 5.00 6.00 8.00 10.00 ▁▃▇▇▃
i_found_partner__sincere 0 1 7.29 1.71 0.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__intelligent 0 1 7.46 1.55 0.00 7.00 8.00 8.00 10.00 ▁▁▃▇▃
i_found_partner__fun 0 1 6.54 1.94 0.00 5.00 7.00 8.00 10.00 ▁▂▆▇▃
i_found_partner__ambitious 0 1 6.83 1.79 0.00 6.00 7.00 8.00 10.00 ▁▁▆▇▃
i_found_partner__interests 0 1 5.57 2.15 0.00 4.00 6.00 7.00 10.00 ▂▅▇▆▂
partner_found_me__attractive 0 1 6.23 1.89 1.00 5.00 6.00 8.00 10.00 ▁▃▇▇▂
partner_found_me__sincere 0 1 7.13 1.74 1.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__intelligent 0 1 7.36 1.50 1.00 6.00 7.00 8.00 10.00 ▁▁▃▇▃
partner_found_me__fun 0 1 6.45 1.97 0.00 5.00 7.00 8.00 10.00 ▁▂▇▇▃
partner_found_me__ambitious 0 1 6.80 1.76 1.00 6.00 7.00 8.00 10.00 ▁▂▆▇▃
partner_found_me__interests 0 1 5.48 2.11 0.00 4.00 6.00 7.00 10.00 ▂▅▇▅▂
probability_i_find_partner_liked_me 0 1 5.36 2.12 0.00 4.00 5.00 7.00 10.00 ▂▃▇▅▁
you_look_for__attractive 0 1 0.22 0.11 0.00 0.15 0.20 0.25 0.90 ▇▇▁▁▁
you_look_for__sincere 0 1 0.18 0.07 0.00 0.15 0.19 0.20 0.47 ▁▇▇▁▁
you_look_for__intelligent 0 1 0.20 0.06 0.00 0.18 0.20 0.23 0.50 ▁▇▃▁▁
you_look_for__fun 0 1 0.17 0.06 0.00 0.15 0.18 0.20 0.50 ▂▇▂▁▁
you_look_for__ambitious 0 1 0.11 0.06 0.00 0.05 0.10 0.15 0.36 ▆▇▇▁▁
you_look_for__shared_interests 0 1 0.12 0.06 0.00 0.08 0.11 0.16 0.30 ▆▇▆▅▁
you_perceive_yourself__attractive 0 1 7.18 1.38 2.00 7.00 7.00 8.00 10.00 ▁▂▂▇▂
you_perceive_yourself__sincere 0 1 8.33 1.45 2.00 8.00 9.00 9.00 10.00 ▁▁▁▆▇
you_perceive_yourself__fun 0 1 7.83 1.58 2.00 7.00 8.00 9.00 10.00 ▁▁▂▇▇
you_perceive_yourself__intelligent 0 1 8.49 1.08 3.00 8.00 9.00 9.00 10.00 ▁▁▂▆▇
you_perceive_yourself__ambitious 0 1 7.68 1.80 2.00 7.00 8.00 9.00 10.00 ▁▂▂▇▇
probability_partner_find_i_liked_partner 0 1 5.27 2.05 0.00 4.00 5.00 7.00 10.00 ▂▃▇▅▁
partner_perceives_himself__attractive 0 1 7.15 1.39 2.00 6.00 7.00 8.00 10.00 ▁▁▂▇▂
partner_perceives_himself__sincere 0 1 8.27 1.49 2.00 7.00 9.00 9.00 10.00 ▁▁▁▆▇
partner_perceives_himself__fun 0 1 7.75 1.57 2.00 7.00 8.00 9.00 10.00 ▁▁▂▇▆
partner_perceives_himself__intelligent 0 1 8.46 1.06 4.00 8.00 9.00 9.00 10.00 ▁▁▂▅▇
partner_perceives_himself__ambitious 0 1 7.59 1.89 2.00 7.00 8.00 9.00 10.00 ▁▂▂▇▇

Especificação e pré-processamento, pra qualquer modelo

Conforme comentamos, as bibliotecas da tidymodels são ortogonais.

A biblioteca recipes serve a um fim específico: definir os passos do pré-processamento dos dados. Esses passos podem ser definidos de forma independente da definição do modelo a ser usado, da forma de cross-validation, da medição da performance etc.

Um dos passos que podem ser definidos na recipes é a identificação de atributos que não têm papel preditivo e, portanto, não devem ser usados no treinamento e na predição, mas que queremos manter no nosso tibble para identificação das linhas.

receita <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
  update_role(
    match,
    unique_id_number,
    id_within_wave,
    subject_within_gender,
    partnet_id_within_wave,
    partner_unique_id_number,
    new_role = "ID"
  )

summary(receita)
## # A tibble: 64 x 4
##    variable                 type    role      source  
##    <chr>                    <chr>   <chr>     <chr>   
##  1 sex                      nominal predictor original
##  2 choice                   nominal predictor original
##  3 partner_race             nominal predictor original
##  4 frequency_date           nominal predictor original
##  5 career_macro             nominal predictor original
##  6 opinion_duration_of_date nominal predictor original
##  7 race                     nominal predictor original
##  8 goal                     nominal predictor original
##  9 match                    numeric ID        original
## 10 same_race                numeric predictor original
## # ... with 54 more rows

Criando dummies

Alguns engines de modelos não trabalham bem com fatores.

Quem está acostumado com a lm, sabe que os fatores são transformados automaticamente em dummies, mas isso não acontece com todos os engines.

step_dummy() faz esse trabalho, ou seja, cria uma variável binária pra cada level do fator (menos um). Veja como podemos usar o seletor all_nominal() e o all_outcomes()

step_zv() retira as variáveis com variância zero. Isso vai acontecer bastante quando temos levels de fatores infrequentes.

receita <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
  update_role(
    match,
    unique_id_number,
    id_within_wave,
    subject_within_gender,
    partnet_id_within_wave,
    partner_unique_id_number,
    new_role = "ID"
  ) %>% 
  step_dummy(
    all_nominal(), -all_outcomes() 
  ) %>% 
  step_zv(all_predictors()) 
  


summary(receita) %>% 
  gt()
variable type role source
sex nominal predictor original
choice nominal predictor original
partner_race nominal predictor original
frequency_date nominal predictor original
career_macro nominal predictor original
opinion_duration_of_date nominal predictor original
race nominal predictor original
goal nominal predictor original
match numeric ID original
same_race numeric predictor original
partner_liked_me numeric predictor original
met_before numeric predictor original
unique_id_number numeric ID original
id_within_wave numeric ID original
subject_within_gender numeric ID original
n_people_met_in_wave numeric predictor original
position_meeting numeric predictor original
order_meeting numeric predictor original
partnet_id_within_wave numeric ID original
partner_unique_id_number numeric ID original
interests_correlation numeric predictor original
my_age numeric predictor original
partner_age numeric predictor original
partner_stated_pref_time0_attractive numeric predictor original
partner_stated_pref_time0_sincere numeric predictor original
partner_stated_pref_time0_intelligent numeric predictor original
partner_stated_pref_time0_fun numeric predictor original
partner_stated_pref_time0_ambitious numeric predictor original
partner_stated_pref_time0_shared_interests numeric predictor original
importance_same_race numeric predictor original
importance_same_religion numeric predictor original
happy_expec numeric predictor original
i_found_partner__attractive numeric predictor original
i_found_partner__sincere numeric predictor original
i_found_partner__intelligent numeric predictor original
i_found_partner__fun numeric predictor original
i_found_partner__ambitious numeric predictor original
i_found_partner__interests numeric predictor original
partner_found_me__attractive numeric predictor original
partner_found_me__sincere numeric predictor original
partner_found_me__intelligent numeric predictor original
partner_found_me__fun numeric predictor original
partner_found_me__ambitious numeric predictor original
partner_found_me__interests numeric predictor original
probability_i_find_partner_liked_me numeric predictor original
you_look_for__attractive numeric predictor original
you_look_for__sincere numeric predictor original
you_look_for__intelligent numeric predictor original
you_look_for__fun numeric predictor original
you_look_for__ambitious numeric predictor original
you_look_for__shared_interests numeric predictor original
you_perceive_yourself__attractive numeric predictor original
you_perceive_yourself__sincere numeric predictor original
you_perceive_yourself__fun numeric predictor original
you_perceive_yourself__intelligent numeric predictor original
you_perceive_yourself__ambitious numeric predictor original
probability_partner_find_i_liked_partner numeric predictor original
partner_career_macro nominal predictor original
partner_perceives_himself__attractive numeric predictor original
partner_perceives_himself__sincere numeric predictor original
partner_perceives_himself__fun numeric predictor original
partner_perceives_himself__intelligent numeric predictor original
partner_perceives_himself__ambitious numeric predictor original
i_liked_partner nominal outcome original

Criando variáveis ordinais

Alguns fatores são ordinais. Pode ser uma boa ideia codificá-los em uma só variável numérica, que vai manter a ordem natural dos levels.

No nosso exemplo, a variável que representa com qual frequência as pessoas saem à noite apresenta níveis que podem ser ordenados.

receita <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
  update_role(
    match,
    unique_id_number,
    id_within_wave,
    subject_within_gender,
    partnet_id_within_wave,
    partner_unique_id_number,
    new_role = "ID"
  ) %>% 
  step_ordinalscore(
    frequency_date
  ) %>% 
  step_dummy(
    all_nominal(), -frequency_date, -all_outcomes()
  ) %>%
  step_zv(all_predictors()) 

Criando um workflow para estimar um modelo

Sabendo criar uma receita de pré-processamento e relembrando como criar uma interface genérica para um modelo com parsnip() e como selecionar um engine pra ele, podemos criar um pequeno fluxo de trabalho para realizar esse processamento, usando a biblioteca workflows()

lr_mod <- 
  logistic_reg() %>% 
  set_engine("glm") 

wf <- workflow() %>% 
  add_recipe(receita) %>% 
  add_model(lr_mod) 

wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_ordinalscore()
## * step_dummy()
## * step_zv()
## 
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

Estimando o modelo com uso do workflow

Com o workflow criado é possível estimá-lo usando a função fit()

fit_teste <- 
  wf %>% 
  fit(
    data = dado_treino
  )

fit_teste %>%  tidy() %>% 
  select(
    term,
    estimate,
    p.value
  ) %>% 
  arrange(
    p.value
  ) %>% 
  gt() %>% 
  fmt_number(
    columns = one_of("estimate"),
    decimals = 3
  ) %>% 
  fmt_number(
    columns = one_of("p.value"),
    decimals = 2
  ) 
term estimate p.value
i_found_partner__attractive −0.573 0.00
probability_i_find_partner_liked_me −0.308 0.00
i_found_partner__interests −0.222 0.00
i_found_partner__fun −0.279 0.00
you_look_for__attractive 5.085 0.00
you_perceive_yourself__fun 0.171 0.00
race_White 0.620 0.00
sex_Homem −0.536 0.00
partner_found_me__attractive 0.120 0.00
career_macro_Politics 2.790 0.00
you_perceive_yourself__sincere 0.133 0.00
same_race −0.376 0.00
importance_same_race 0.063 0.00
i_found_partner__ambitious 0.123 0.00
goal_Serious −0.984 0.00
career_macro_Pro.sports.Athletics −4.469 0.00
probability_partner_find_i_liked_partner −0.083 0.00
you_look_for__ambitious 4.241 0.00
partner_career_macro_Speech.Pathology 2.574 0.00
n_people_met_in_wave −0.054 0.00
i_found_partner__sincere 0.110 0.01
career_macro_Lawyer 0.515 0.01
partner_career_macro_Psychologist 0.729 0.01
partner_career_macro_Creative.Arts.Entertainment 0.443 0.01
career_macro_Other 1.721 0.03
goal_Other 0.623 0.03
frequency_date 0.077 0.03
importance_same_religion 0.040 0.03
partner_liked_me 0.229 0.04
partner_found_me__interests 0.060 0.04
partner_career_macro_Journalism −1.347 0.04
career_macro_Social.Work −0.596 0.04
my_age 0.029 0.05
partner_career_macro_Real.Estate 1.255 0.05
choice_limited −0.429 0.05
happy_expec −0.057 0.06
opinion_duration_of_date_Too.little 0.188 0.06
race_Latino 0.352 0.07
you_perceive_yourself__attractive 0.078 0.08
you_look_for__sincere 1.770 0.08
you_look_for__intelligent 1.716 0.09
partner_perceives_himself__intelligent 0.089 0.09
opinion_duration_of_date_Too.much −0.538 0.09
career_macro_Creative.Arts.Entertainment 0.312 0.09
partner_perceives_himself__fun −0.059 0.11
partner_career_macro_Lawyer −0.311 0.12
partner_race_Others −0.314 0.12
partner_career_macro_Pro.sports.Athletics −2.273 0.12
(Intercept) 2.235 0.13
partner_found_me__intelligent −0.061 0.17
goal_Meet.new.people −0.244 0.18
career_macro_Engineer 0.390 0.19
career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin −0.168 0.20
partner_stated_pref_time0_intelligent −1.271 0.20
partner_perceives_himself__ambitious 0.040 0.22
partner_career_macro_Doctor.Medicine −0.248 0.22
career_macro_Doctor.Medicine 0.231 0.24
partner_career_macro_Social.Work 0.354 0.26
partner_perceives_himself__sincere −0.041 0.26
position_meeting −0.010 0.26
you_look_for__fun −1.132 0.27
race_Black −0.278 0.27
partner_found_me__ambitious −0.037 0.27
career_macro_Journalism −0.690 0.27
you_perceive_yourself__ambitious −0.033 0.30
interests_correlation 0.150 0.33
partner_career_macro_Engineer 0.265 0.38
i_found_partner__intelligent −0.043 0.38
partner_perceives_himself__attractive 0.036 0.40
career_macro_Psychologist −0.223 0.40
career_macro_Speech.Pathology 0.986 0.42
partner_career_macro_Architecture −0.810 0.42
career_macro_Undecided −0.187 0.43
partner_found_me__sincere −0.027 0.46
partner_stated_pref_time0_sincere 0.732 0.48
career_macro_Architecture −0.753 0.50
partner_career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin −0.078 0.55
race_Others −0.121 0.56
partner_race_Black −0.124 0.57
career_macro_Real.Estate −0.987 0.60
partner_career_macro_Other −0.298 0.60
career_macro_International.Humanitarian.Affairs 0.094 0.65
partner_stated_pref_time0_ambitious 0.479 0.71
partner_career_macro_International.Humanitarian.Affairs 0.063 0.74
goal_To.say −0.084 0.75
goal_Fun 0.054 0.76
partner_stated_pref_time0_fun 0.274 0.79
partner_stated_pref_time0_attractive −0.134 0.87
partner_career_macro_Undecided 0.035 0.88
met_before 0.027 0.90
you_perceive_yourself__intelligent −0.006 0.91
partner_found_me__fun −0.004 0.91
partner_race_White −0.014 0.91
partner_race_Latino −0.015 0.94
order_meeting 0.000 0.97
partner_age 0.001 0.97
partner_career_macro_Politics −0.022 0.97
partner_stated_pref_time0_shared_interests NA NA
you_look_for__shared_interests NA NA

Adicionando interação entre preditores

Um tipo de passo interessante que pode ser adicionado à recipe é a adição de novos preditores que representam interações entre os preditores originais.

É interessante poder usar as funções helpers, que tornam muito mais fácil a adição de vários termos de interação de uma só vez.

Aqui vamos usar uma regressão ao invés de classificação para facilitar a interpretação dos coeficientes.

dado_treino_regressao <- dado_treino %>% 
  mutate(
    i_liked_partner = if_else(i_liked_partner == "Liked", 1, 0)
  )


receita_com_interacao <- recipe(i_liked_partner ~ ., data = dado_treino_regressao) %>%
  update_role(
    match,
    unique_id_number,
    id_within_wave,
    subject_within_gender,
    partnet_id_within_wave,
    partner_unique_id_number,
    new_role = "ID"
  ) %>% 
  step_ordinalscore(
    frequency_date
  ) %>% 
  step_dummy(
    all_nominal(), -frequency_date, -all_outcomes() 
  ) %>%
  step_interact(
    terms = ~ starts_with("race")*starts_with("partner_race")*importance_same_race
  ) %>% 
  step_interact(
    terms = ~ starts_with("i_found_partner__")*starts_with("sex_")
  ) %>%  
  step_interact(
    terms = ~ starts_with("interests_correlation")*starts_with("sex_")
  ) %>% 
  step_interact(
    terms = ~ starts_with("partner_age")*starts_with("sex_")
  ) %>%  
  step_interact(
    terms = ~ starts_with("you_perceive_yourself__")*starts_with("sex_")
  ) %>% 
  step_interact(
    terms = ~ starts_with("partner_perceives_himself__")*starts_with("sex_")
  ) %>% 
  step_interact(
    terms = ~ starts_with("career_macro")*starts_with("partner_career_macro")
  ) %>% 
  step_zv(all_predictors()) 

Normalizando os preditores

Alguns modelos funcionam melhor com os preditores normalizados, por exemplo os que aplicam penalidades aos coeficientes que multiplicam os preditores. Com os preditores em uma faixa parecida, esta penalização é mais justa.

receita_com_interacao <- recipe(i_liked_partner ~ ., data = dado_treino_regressao) %>%
  update_role(
    match,
    unique_id_number,
    id_within_wave,
    subject_within_gender,
    partnet_id_within_wave,
    partner_unique_id_number,
    new_role = "ID"
  ) %>% 
  step_ordinalscore(
    frequency_date
  ) %>% 
  step_dummy(
    all_nominal(), -frequency_date, -all_outcomes() 
  ) %>%
  step_interact(
    terms = ~ starts_with("race")*starts_with("partner_race")*importance_same_race
  ) %>% 
  step_interact(
    terms = ~ starts_with("i_found_partner__")*starts_with("sex_")
  ) %>%  
  step_interact(
    terms = ~ starts_with("interests_correlation")*starts_with("sex_")
  ) %>% 
  step_interact(
    terms = ~ starts_with("partner_age")*starts_with("sex_")
  ) %>%  
  step_interact(
    terms = ~ starts_with("you_perceive_yourself__")*starts_with("sex_")
  ) %>% 
  step_interact(
    terms = ~ starts_with("partner_perceives_himself__")*starts_with("sex_")
  ) %>% 
  step_interact(
    terms = ~ starts_with("career_macro")*starts_with("partner_career_macro")
  ) %>% 
  step_zv(all_predictors()) %>% 
  step_center(all_numeric()) %>%
  step_scale(all_numeric())

Regressões lasso e ridge

Com a adição destes termos de interação, ficamos com muitos preditores, vários deles bem correlacionados.

O uso de termos muito correlacionados (não só em pares) pode levar ao fenômeno da colinearidade. O modelo pode atribuir efeito a uma ou outra variável de entrada de acordo com a amostra usada pra treinamento, ficando, portanto, com maior variância.

Uma forma de evitar que muitas variáveis sejam efetivamente usadas no modelo é aplicar uma penalidade de forma a diminuir o número de coeficientes acionados, e, por consequência a variância do modelo. É isso que as regressões do tipo Ridge e Lasso fazem.

Essas regressões que penalizam o número e o tamanho do efeito das relações entre as variáveis de entrada e os coeficientes trocam viés por variância e (no caso da lasso) interpretabilidade: elas têm mais viés, se adaptam menos ao conjunto de treinamento, mas não variam tanto dependendo de qual amostra da população foi escolhida para o treinamento. Além disso, por ter menos coeficientes “ligados” (no caso da lasso), é mais interpretável.

O modelo Elastic Net conjuga a penalização do tipo Ridge com a penalização do tipo Lasso modificando a função de penalização da regressão, que na regressão de mínimos quadrados ordinários, a mais comum, como o nome diz, é o erro quadrático:

\[RSS = \sum_{i = 1}^{n} ( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij})^2 \]

Para a regressão Ridge, os coeficientes são penalizados de forma quadrática. Isso diminui a variância do modelo mas não diminui tantoo número de coeficientes diferentes de 0:

\[Loss_{Ridge} = RSS + \lambda \sum_{j=1}^{p}\beta_j^2 \]

Para a regressão Lasso, os coeficientes são penalizados pelo seu valor absoluto. Isso diminui a variância do modelo E diminui o número de coeficientes diferentes de 0, favorecendo a interpretabilidade

\[Loss_{Lasso} = RSS + \lambda \sum_{j=1}^{p} \left| \beta_j \right| \]

Utilizando os hiperparâmetros

O conceito de interface é importante na engenharia de software.

É sempre melhor depender de interfaces do que de implementações. A parnsip funciona como uma camada de abstração que oferece uma interface única que se encarrega de cuidar da chamada às diferentes implementações.

As interfaces genéricas da parnsip estão prontas para receber os parâmetros mais comuns usados nos engines e estão preparadas para passar ao engine o valor destes parâmetros.

No caso das regressões lineares, alguns engines como o glmnet estão preparados para receber os hiperparâmetros necessários para a implementação da regressão Elçastic Net lasso-ridge.

lr_mod <- 
  linear_reg(penalty = .02, mixture = 1) %>% 
  set_engine("glmnet")

wf_com_interacao <- workflow() %>% 
  add_recipe(receita_com_interacao) %>% 
  add_model(lr_mod) 

wf_com_interacao
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: linear_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 12 Recipe Steps
## 
## * step_ordinalscore()
## * step_dummy()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_zv()
## * ...
## * and 2 more steps.
## 
## -- Model -----------------------------------------------------------------------
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.02
##   mixture = 1
## 
## Computational engine: glmnet

Fit com Elastic Net

Na estimação com Elastic Net temos menos preditores acionados

fit_com_interacao <- 
  wf_com_interacao %>% 
  fit(
    data = dado_treino_regressao
  )

fit_com_interacao %>%  tidy() %>% 
  filter(
    estimate != 0
  ) %>% 
  arrange(
    estimate %>% abs() %>% desc()
  ) %>% 
  select(
    term,
    estimate
  ) %>% 
  gt() %>% 
  fmt_number(
    columns = one_of("estimate"),
    decimals = 3
  )
term estimate
i_found_partner__attractive 0.290
probability_i_find_partner_liked_me 0.147
i_found_partner__interests 0.113
i_found_partner__fun 0.100
i_found_partner__attractive_x_sex_Homem 0.081
you_look_for__attractive −0.055
partner_found_me__attractive −0.055
race_White −0.040
you_perceive_yourself__attractive −0.033
probability_partner_find_i_liked_partner 0.033
you_perceive_yourself__fun −0.029
you_perceive_yourself__sincere −0.029
career_macro_Politics −0.028
importance_same_race −0.021
goal_Other −0.021
you_look_for__shared_interests 0.021
partner_career_macro_Speech.Pathology −0.021
goal_Serious 0.021
i_found_partner__sincere −0.021
career_macro_Creative.Arts.Entertainment_x_partner_career_macro_Creative.Arts.Entertainment −0.020
career_macro_Pro.sports.Athletics 0.020
happy_expec 0.019
i_found_partner__ambitious −0.017
partner_liked_me −0.016
career_macro_Psychologist_x_partner_career_macro_Architecture 0.015
importance_same_religion −0.015
career_macro_Lawyer_x_partner_career_macro_Creative.Arts.Entertainment −0.014
race_Latino_x_importance_same_race −0.014
you_look_for__fun 0.012
career_macro_International.Humanitarian.Affairs_x_partner_career_macro_Creative.Arts.Entertainment −0.011
same_race 0.011
career_macro_Social.Work 0.011
career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin_x_partner_career_macro_Engineer −0.010
goal_Meet.new.people 0.009
opinion_duration_of_date_Too.much 0.009
career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin 0.008
partner_career_macro_Lawyer 0.008
career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin_x_partner_career_macro_Speech.Pathology 0.008
race_White_x_importance_same_race −0.008
partner_career_macro_Real.Estate −0.007
frequency_date −0.006
career_macro_Other_x_partner_career_macro_Lawyer −0.006
career_macro_Doctor.Medicine_x_partner_career_macro_International.Humanitarian.Affairs −0.006
partner_found_me__intelligent 0.006
career_macro_Other −0.006
career_macro_Engineer_x_partner_career_macro_Social.Work −0.005
career_macro_Undecided_x_partner_career_macro_Doctor.Medicine 0.005
partner_career_macro_Journalism 0.005
partner_race_Latino_x_importance_same_race −0.004
career_macro_Lawyer_x_partner_career_macro_Psychologist −0.004
partner_career_macro_Psychologist −0.003
career_macro_Creative.Arts.Entertainment −0.003
partner_found_me__sincere 0.003
career_macro_Lawyer_x_partner_career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin −0.003
career_macro_Creative.Arts.Entertainment_x_partner_career_macro_Speech.Pathology −0.002
career_macro_Engineer_x_partner_career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin 0.002
n_people_met_in_wave 0.002
you_look_for__ambitious −0.002
career_macro_Psychologist_x_partner_career_macro_Psychologist 0.002
career_macro_International.Humanitarian.Affairs_x_partner_career_macro_Engineer −0.001
career_macro_Social.Work_x_partner_career_macro_Other −0.001
career_macro_International.Humanitarian.Affairs_x_partner_career_macro_International.Humanitarian.Affairs −0.001
race_Others_x_partner_race_Latino 0.001
race_Latino_x_partner_race_Latino_x_importance_same_race −0.000
partner_stated_pref_time0_attractive 0.000
career_macro_Lawyer_x_partner_career_macro_Other 0.000
position_meeting 0.000
career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin_x_partner_career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin 0.000
career_macro_Journalism 0.000
partner_career_macro_Social.Work −0.000
(Intercept) −0.000

Mais tipos de pré-processamento

Aqui adicionamos mais alguns passos de pré-processamento:

receita_com_interacao_class <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
  update_role(
    match,
    unique_id_number,
    id_within_wave,
    subject_within_gender,
    partnet_id_within_wave,
    partner_unique_id_number,
    new_role = "ID"
  ) %>% 
  step_ordinalscore(
    frequency_date
  ) %>% 
  step_dummy(
    all_nominal(), -frequency_date, -all_outcomes() 
  ) %>%
  step_mutate(
    diff_age = my_age - partner_age     
  ) %>% 
  step_poly(
    diff_age
  ) %>% 
  step_interact(
    terms = ~ starts_with("diff_age") * starts_with("sex")
  ) %>% 
  step_interact(
    terms = ~ starts_with("race")*starts_with("partner_race")*importance_same_race
  ) %>% 
  step_interact(
    terms = ~ starts_with("i_found_partner__")*starts_with("sex_")
  ) %>%  
  step_interact(
    terms = ~ starts_with("interests_correlation")*starts_with("sex_")
  ) %>%
  step_interact(
    terms = ~ starts_with("partner_age")*starts_with("sex_")
  ) %>%
  step_interact(
    terms = ~ starts_with("you_perceive_yourself__")*starts_with("sex_")
  ) %>%
  step_interact(
    terms = ~ starts_with("partner_perceives_himself__")*starts_with("sex_")
  ) %>%
  step_interact(
    terms = ~ starts_with("partner_found_me__")*starts_with("sex")
  ) %>%
  step_interact(
    terms = ~ starts_with("you_look_for__")*starts_with("sex")
  ) %>%
  step_interact(
    terms = ~ starts_with("you_look_for__")*starts_with("i_found_partner")
  ) %>%
  step_interact(
    terms = ~ starts_with("probability_partner_find_i_liked_partner")*starts_with("sex")
  ) %>%
  step_interact(
    terms = ~ starts_with("career_")*starts_with("sex")
  ) %>%
  step_interact(
    terms = ~ starts_with("partner_career_")*starts_with("sex")
  ) %>%
  step_interact(
    terms = ~ starts_with("goal")*starts_with("sex")
  ) %>%
  step_corr(
    all_predictors(), 
    threshold = 0.8  
  ) %>% 
  step_zv(all_predictors()) %>% 
  step_center(all_numeric()) %>%
  step_scale(all_numeric())
  
  
log_mod <- 
  logistic_reg() %>% 
  set_engine("glm")

wf_com_interacao_class <- workflow() %>% 
  add_recipe(receita_com_interacao_class) %>% 
  add_model(log_mod)


fit_com_interacao_class <- 
  wf_com_interacao_class %>% 
  fit(
    data = dado_treino
  )

Avaliando a curva ROC a partir da yardstick

A curva ROC (Receiver Operating Characteristics) foi inventada na época da Segunda Guerra Mundial para avaliar se os operadores de radar americanos estavam detectando confiavelmente aeronaves japonesas a partir de sinais de radar.

A curva mostra, para vários thresholds, qual a fração de verdadeiros positivos (ou sensibilidade) e a fração de falsos positivos (fall-out, ou \(1 - especificidade\) ).

Uma métrica numérica que traduz o a precisão geral de um modelo de classificação consiste na área embaixo desta curva (AUC, Area Under the Curve). Note que quanto mais perto de um essa área, menor a taxa de falsos positivos e maior a sensibilidade

Aqui executamos a predição nos próprios dados de treinamento e usamos função roc_curve() da yardstick para gerar os dados necessários para plotara curva

pred_like <- predict(
  object = fit_com_interacao_class,
  new_data = dado_teste,
  type = "prob"
) %>% 
  bind_cols(dado_teste %>%  select(i_liked_partner))


dados_roc <- pred_like %>% 
  roc_curve(
    truth = i_liked_partner,
    .pred_Liked
  )

dados_roc %>% 
  filter(
    row_number() %% 100 == 0
  ) %>% 
  gt() %>% 
  fmt_number(
    columns = everything(),
    decimals = 3
  )
.threshold specificity sensitivity
0.015 0.137 0.989
0.048 0.277 0.978
0.109 0.409 0.958
0.188 0.543 0.940
0.297 0.656 0.896
0.446 0.751 0.831
0.542 0.833 0.749
0.669 0.896 0.643
0.787 0.936 0.510
0.871 0.967 0.366
0.939 0.988 0.209
0.991 0.999 0.040

Plotamos, então, a curva

ponto_gatilho <- dados_roc %>% 
  filter(
    .threshold > 0.5
  ) %>% 
  slice_min(
    n = 1, order_by = .threshold
  )

dados_roc %>% ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_point(
    data = ponto_gatilho,
    aes(x = 1 - specificity, y = sensitivity),
    size = 3,
    color = "darkblue"
  ) +
  geom_text_repel(
    data = ponto_gatilho,
    aes(
      x = 1 - specificity + 0.15, 
      y = sensitivity - 0.15,
      label = str_glue("{sensitivity %>% percent(accuracy = 0.1)}/{(1 - specificity) %>%  percent(accuracy = 0.1)}")
    )
  ) +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_bw()

Existe uma função que plota a curva automaticamente

dados_roc %>% 
  autoplot()

A função roc_auc calcula a área embaixo da curva

pred_like %>% 
  roc_auc(
    truth = i_liked_partner,
    .pred_Liked
  ) %>% 
  gt()
.metric .estimator .estimate
roc_auc binary 0.8686356

Rodando sem interação

receita_sem_interacao_class <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
  update_role(
    match,
    unique_id_number,
    id_within_wave,
    subject_within_gender, 
    partnet_id_within_wave,
    partner_unique_id_number,
    new_role = "ID"
  ) %>% 
  step_ordinalscore(
    frequency_date
  ) %>% 
  step_dummy(
    all_nominal(), -frequency_date, -all_outcomes() 
  )

wf_sem_interacao_class <- workflow() %>% 
  add_recipe(receita_sem_interacao_class) %>% 
  add_model(log_mod)


fit_sem_interacao_class <- 
  wf_sem_interacao_class %>% 
  fit(
    data = dado_treino
  )

pred_like_sem <- predict(
  object = fit_sem_interacao_class,
  new_data = dado_teste,
  type = "prob"
) %>% 
  bind_cols(dado_teste %>%  select(i_liked_partner))



pred_like_sem %>% 
  roc_curve(
    truth = i_liked_partner,
    .pred_Liked
  ) %>% 
  autoplot()

pred_like_sem %>% 
  roc_auc(
    truth = i_liked_partner,
    .pred_Liked
  ) %>% 
  gt()
.metric .estimator .estimate
roc_auc binary 0.8669198

Comparando as curvas

As curvas ficaram muito parecidas.

Nao adiantou muito a adição de interações, mas isso pode ser causado pela simplicidade do modelo

rocs <- bind_rows(
  roc_curve(pred_like_sem, i_liked_partner, .pred_Liked) %>% mutate(tipo = "Sem interação"),
  roc_curve(pred_like, i_liked_partner, .pred_Liked) %>% mutate(tipo = "Com interação"),
)
  




rocs %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = tipo)) +
  geom_path() +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_bw() +
  theme(
    legend.position = "top"
  )

Implementando cross validation

A biblioteca rsamples, que já vimos, oferece a infraestrutura para geração das amostras necessárias para o Cross Validation.

Relembrando, temos que montar o seguinte esquema:

No código abaixo, fazemos 2 divisões em 5 partes. Ficaremos, portanto, com 10 particionamentos entre treinamento e validação.

folds <- vfold_cv(dado_treino, v = 5, repeats = 2, strata = i_liked_partner)

folds
## #  5-fold cross-validation repeated 2 times using stratification 
## # A tibble: 10 x 3
##    splits             id      id2  
##    <list>             <chr>   <chr>
##  1 <split [2.9K/734]> Repeat1 Fold1
##  2 <split [2.9K/734]> Repeat1 Fold2
##  3 <split [2.9K/732]> Repeat1 Fold3
##  4 <split [2.9K/732]> Repeat1 Fold4
##  5 <split [2.9K/732]> Repeat1 Fold5
##  6 <split [2.9K/734]> Repeat2 Fold1
##  7 <split [2.9K/734]> Repeat2 Fold2
##  8 <split [2.9K/732]> Repeat2 Fold3
##  9 <split [2.9K/732]> Repeat2 Fold4
## 10 <split [2.9K/732]> Repeat2 Fold5
analysis(x = folds$splits[[1]] ) %>%  glimpse()
## Rows: 2,930
## Columns: 64
## $ sex                                        <fct> Mulher, Mulher, Mulher, ...
## $ choice                                     <fct> limited, limited, limite...
## $ partner_race                               <fct> Asian, White, White, Whi...
## $ frequency_date                             <ord> Once a month, Once a mon...
## $ career_macro                               <fct> Lawyer, Lawyer, Lawyer, ...
## $ opinion_duration_of_date                   <fct> Just Right, Just Right, ...
## $ race                                       <fct> White, White, White, Whi...
## $ goal                                       <fct> Fun, Fun, Fun, Fun, Fun,...
## $ match                                      <dbl> 0, 1, 0, 0, 0, 1, 0, 0, ...
## $ same_race                                  <dbl> 0, 1, 1, 1, 1, 1, 1, 1, ...
## $ partner_liked_me                           <dbl> 1, 1, 1, 1, 0, 1, 0, 0, ...
## $ i_liked_partner                            <fct> Not, Liked, Not, Not, No...
## $ met_before                                 <dbl> 0, 1, 0, 0, 0, 0, 0, 0, ...
## $ unique_id_number                           <dbl> 4, 4, 4, 4, 4, 4, 4, 5, ...
## $ id_within_wave                             <dbl> 4, 4, 4, 4, 4, 4, 4, 5, ...
## $ subject_within_gender                      <dbl> 7, 7, 7, 7, 7, 7, 7, 9, ...
## $ n_people_met_in_wave                       <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting                           <dbl> 6, 6, 6, 6, 6, 6, 6, 4, ...
## $ order_meeting                              <dbl> 9, 4, 5, 10, 1, 7, 8, 1,...
## $ partnet_id_within_wave                     <dbl> 3, 4, 6, 7, 8, 9, 10, 1,...
## $ partner_unique_id_number                   <dbl> 13, 14, 16, 17, 18, 19, ...
## $ interests_correlation                      <dbl> 0.05, -0.18, 0.37, 0.35,...
## $ my_age                                     <dbl> 23, 23, 23, 23, 23, 23, ...
## $ partner_age                                <dbl> 22, 23, 25, 30, 27, 28, ...
## $ partner_stated_pref_time0_attractive       <dbl> 0.1900000, 0.3000000, 0....
## $ partner_stated_pref_time0_sincere          <dbl> 0.1800000, 0.0500000, 0....
## $ partner_stated_pref_time0_intelligent      <dbl> 0.1900000, 0.1500000, 0....
## $ partner_stated_pref_time0_fun              <dbl> 0.1800000, 0.4000000, 0....
## $ partner_stated_pref_time0_ambitious        <dbl> 0.1400000, 0.0500000, 0....
## $ partner_stated_pref_time0_shared_interests <dbl> 0.1200000, 0.0500000, 0....
## $ importance_same_race                       <dbl> 1, 1, 1, 1, 1, 1, 1, 8, ...
## $ importance_same_religion                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ happy_expec                                <dbl> 1, 1, 1, 1, 1, 1, 1, 7, ...
## $ i_found_partner__attractive                <dbl> 4, 8, 5, 7, 5, 9, 8, 5, ...
## $ i_found_partner__sincere                   <dbl> 7, 10, 10, 10, 9, 8, 9, ...
## $ i_found_partner__intelligent               <dbl> 8, 7, 8, 10, 9, 10, 10, ...
## $ i_found_partner__fun                       <dbl> 8, 10, 4, 7, 5, 10, 10, ...
## $ i_found_partner__ambitious                 <dbl> 6, 7, 8, 10, 9, 7, 8, 2,...
## $ i_found_partner__interests                 <dbl> 7, 10, 2, 5, 7, 8, 8, 2,...
## $ partner_found_me__attractive               <dbl> 10, 7, 6, 7, 6, 7, 7, 6,...
## $ partner_found_me__sincere                  <dbl> 10, 7, 6, 6, 7, 7, 8, 8,...
## $ partner_found_me__intelligent              <dbl> 10, 7, 7, 3, 8, 7, 8, 8,...
## $ partner_found_me__fun                      <dbl> 10, 9, 7, 5, 6, 10, 7, 8...
## $ partner_found_me__ambitious                <dbl> 10, 9, 8, 6, 6, 9, 8, 7,...
## $ partner_found_me__interests                <dbl> 10, 9, 7, 5, 5, 10, 7, 6...
## $ probability_i_find_partner_liked_me        <dbl> 1, 10, 3, 1, 6, 8, 8, 5,...
## $ you_look_for__attractive                   <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__sincere                      <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__intelligent                  <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__fun                          <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__ambitious                    <dbl> 0.10, 0.10, 0.10, 0.10, ...
## $ you_look_for__shared_interests             <dbl> 0.10, 0.10, 0.10, 0.10, ...
## $ you_perceive_yourself__attractive          <dbl> 7, 7, 7, 7, 7, 7, 7, 6, ...
## $ you_perceive_yourself__sincere             <dbl> 8, 8, 8, 8, 8, 8, 8, 3, ...
## $ you_perceive_yourself__fun                 <dbl> 9, 9, 9, 9, 9, 9, 9, 6, ...
## $ you_perceive_yourself__intelligent         <dbl> 7, 7, 7, 7, 7, 7, 7, 10,...
## $ you_perceive_yourself__ambitious           <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ probability_partner_find_i_liked_partner   <dbl> 10, 10, 6, 4, 7, 8, 7, 5...
## $ partner_career_macro                       <fct> Lawyer, Lawyer, Banking/...
## $ partner_perceives_himself__attractive      <dbl> 4, 9, 6, 7, 6, 10, 7, 8,...
## $ partner_perceives_himself__sincere         <dbl> 7, 9, 6, 7, 8, 6, 7, 9, ...
## $ partner_perceives_himself__fun             <dbl> 8, 9, 8, 6, 6, 10, 10, 7...
## $ partner_perceives_himself__intelligent     <dbl> 8, 9, 8, 8, 8, 10, 10, 8...
## $ partner_perceives_himself__ambitious       <dbl> 3, 9, 6, 4, 9, 10, 10, 5...

Executando cross validation

Executar o cross validation, como vimos, significa estimar o modelo várias vezes em execuções que são completamente independentes, o que é perfeito para executar em paralelo.

A função fit_resamples() faz todas as estimações.

É possível registrar um backend para fazer a execução de forma paralela

all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

log_mod <- 
  logistic_reg() %>% 
  set_engine("glm")

wf_com_interacao_class <- workflow() %>% 
  add_recipe(receita_com_interacao_class) %>% 
  add_model(log_mod)

fit_com_interacao_resample <- 
  wf_com_interacao_class %>% 
  fit_resamples(
    folds,
    control = control_resamples(
      allow_par = TRUE 
    )
  )

stopCluster(cl)

fit_com_interacao_resample
## # Resampling results
## # 5-fold cross-validation repeated 2 times using stratification 
## # A tibble: 10 x 5
##    splits             id      id2   .metrics         .notes          
##    <list>             <chr>   <chr> <list>           <list>          
##  1 <split [2.9K/734]> Repeat1 Fold1 <tibble [2 x 4]> <tibble [3 x 1]>
##  2 <split [2.9K/734]> Repeat1 Fold2 <tibble [2 x 4]> <tibble [2 x 1]>
##  3 <split [2.9K/732]> Repeat1 Fold3 <tibble [2 x 4]> <tibble [3 x 1]>
##  4 <split [2.9K/732]> Repeat1 Fold4 <tibble [2 x 4]> <tibble [2 x 1]>
##  5 <split [2.9K/732]> Repeat1 Fold5 <tibble [2 x 4]> <tibble [3 x 1]>
##  6 <split [2.9K/734]> Repeat2 Fold1 <tibble [2 x 4]> <tibble [3 x 1]>
##  7 <split [2.9K/734]> Repeat2 Fold2 <tibble [2 x 4]> <tibble [3 x 1]>
##  8 <split [2.9K/732]> Repeat2 Fold3 <tibble [2 x 4]> <tibble [2 x 1]>
##  9 <split [2.9K/732]> Repeat2 Fold4 <tibble [2 x 4]> <tibble [3 x 1]>
## 10 <split [2.9K/732]> Repeat2 Fold5 <tibble [2 x 4]> <tibble [3 x 1]>

A função collect_metrics() recebe o objeto retornado por fit_resamples() e retorna uma estrutura com os resultados das execuções, no nosso caso 10. Como não há hiperparâmetros para variar, há apenas duas linhas, com duas métricas relativas ao mesmo conjunto único de hiperparâmetros.

collect_metrics(fit_com_interacao_resample) %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(mean, std_err)
  )
.metric .estimator mean n std_err .config
accuracy binary 76.69% 10 0.28% Preprocessor1_Model1
roc_auc binary 84.66% 10 0.33% Preprocessor1_Model1

Fazendo o tuning nos hiperparâmetros

O modelo de regressão logística que rodamos anteriormente não tinha hiperparâmetros, mas ao rodar modelos como Elastic Net (lasso-ridge) gostaríamos de avaliar qual o melhor conjunto de hiperparâmetros nos dados de validação.

A biblioteca tune nos ajuda a fazer esse procedimento.

No momento de definir o modelo, atribuimos a resposta de tune() aos parâmetros do modelo que queremos variar para fins de tuning

tune_spec_logistic_reg <- logistic_reg(
  penalty = tune(),
  mixture = tune()
) %>% 
  set_engine(
    engine = "glmnet"
  ) %>% 
  set_mode(
    "classification"
  )


tune_spec_logistic_reg
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = tune()
## 
## Computational engine: glmnet

Definindo os valores a serem testados

Uma das formas de executar esta busca pelo melhor conjunto de hiperparâmetros é criar um grid com várias especificações.

A biblioteca dials ajuda na criação deste grid de valores pros hiperparâmetros.

A função grid_regular() cria um grid com intervalos regulares (esses intervalos regulares, dependendo do parâmetro podem ser regulares em log, por exemplo).

Existem funções com os mesmos nomes usados para os parâmetros na interface genérica da parnsnip. Quando chamadas sem parâmetros, essas funções geram valores que normalmente fazem sentido, mas é possível escolher os valores de forma personalizada.

O parâmetro levels define quantos valores diferentes serão usados para cada parâmetro.

net_grid <- grid_regular(
  penalty(range = c(-4,-1)),
  mixture(),
  levels = c(
    penalty = 10,
    mixture = 10
  )
)

net_grid
## # A tibble: 100 x 2
##     penalty mixture
##       <dbl>   <dbl>
##  1 0.0001         0
##  2 0.000215       0
##  3 0.000464       0
##  4 0.001          0
##  5 0.00215        0
##  6 0.00464        0
##  7 0.01           0
##  8 0.0215         0
##  9 0.0464         0
## 10 0.1            0
## # ... with 90 more rows

Rodando o tuning

A função tunegrid() roda a busca dentro deste grid de valores de parâmetros rodando o processo de cross validation de acordo com o que for passado para o parâmetro resample.

all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


wf_logreg_tune_sample <- workflow() %>% 
  add_model(
    tune_spec_logistic_reg
  ) %>% 
  add_recipe(receita_com_interacao_class)


res_logreg_tune_sample_optim <- wf_logreg_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = net_grid,
    control = control_grid(allow_par = TRUE)
  )  

stopCluster(cl)

res_logreg_tune_sample_optim
## # Tuning results
## # 5-fold cross-validation repeated 2 times using stratification 
## # A tibble: 10 x 5
##    splits             id      id2   .metrics           .notes          
##    <list>             <chr>   <chr> <list>             <list>          
##  1 <split [2.9K/734]> Repeat1 Fold1 <tibble [200 x 6]> <tibble [1 x 1]>
##  2 <split [2.9K/734]> Repeat1 Fold2 <tibble [200 x 6]> <tibble [1 x 1]>
##  3 <split [2.9K/732]> Repeat1 Fold3 <tibble [200 x 6]> <tibble [1 x 1]>
##  4 <split [2.9K/732]> Repeat1 Fold4 <tibble [200 x 6]> <tibble [1 x 1]>
##  5 <split [2.9K/732]> Repeat1 Fold5 <tibble [200 x 6]> <tibble [1 x 1]>
##  6 <split [2.9K/734]> Repeat2 Fold1 <tibble [200 x 6]> <tibble [1 x 1]>
##  7 <split [2.9K/734]> Repeat2 Fold2 <tibble [200 x 6]> <tibble [1 x 1]>
##  8 <split [2.9K/732]> Repeat2 Fold3 <tibble [200 x 6]> <tibble [1 x 1]>
##  9 <split [2.9K/732]> Repeat2 Fold4 <tibble [200 x 6]> <tibble [1 x 1]>
## 10 <split [2.9K/732]> Repeat2 Fold5 <tibble [200 x 6]> <tibble [1 x 1]>

Visualizando os resultados

Uma forma de visualizar os resultados é com geom_tile, após rodar a função collect_metrics()

plot_result_tune <- function(results){
  results %>% 
    collect_metrics() %>%
    filter(
      .metric == "roc_auc"
    ) %>% 
    mutate(
      ranque = rank(mean)
    ) %>%     
    ggplot() +
    geom_tile(
      aes(
        x = penalty,
        y = mixture,
        fill = ranque
      )
    ) +
    geom_shadowtext(
      aes(
        x = penalty,
        y = mixture,
        label = percent(mean, accuracy = .01),
      ),
      size = 3,
      color = "white",
      bg.colour="black",
      
    ) +
    scale_x_log10() +
    scale_fill_gradient(low = "white", high = "darkgreen") +
    theme_minimal() +
    theme(
      legend.position = "top"
    )
}


plot_result_tune(res_logreg_tune_sample_optim)

res_logreg_tune_sample_optim %>% 
  collect_metrics() %>% 
  filter(
    .metric == "roc_auc"
  ) %>% 
  arrange(mean %>%  desc()) %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(mean, std_err)
  ) %>% 
  fmt_number(
    columns = vars(penalty, mixture),
    n_sigfig = 2
  )
penalty mixture .metric .estimator mean n std_err .config
0.010 0.33 roc_auc binary 85.32% 10 0.30% Preprocessor1_Model037
0.0046 0.67 roc_auc binary 85.30% 10 0.31% Preprocessor1_Model066
0.0046 0.78 roc_auc binary 85.30% 10 0.31% Preprocessor1_Model076
0.010 0.44 roc_auc binary 85.29% 10 0.29% Preprocessor1_Model047
0.010 0.22 roc_auc binary 85.29% 10 0.31% Preprocessor1_Model027
0.0046 0.56 roc_auc binary 85.29% 10 0.32% Preprocessor1_Model056
0.0046 0.89 roc_auc binary 85.28% 10 0.30% Preprocessor1_Model086
0.0022 1.0 roc_auc binary 85.28% 10 0.33% Preprocessor1_Model095
0.0046 0.44 roc_auc binary 85.26% 10 0.32% Preprocessor1_Model046
0.0046 1.0 roc_auc binary 85.25% 10 0.30% Preprocessor1_Model096
0.0022 0.89 roc_auc binary 85.25% 10 0.33% Preprocessor1_Model085
0.022 0.11 roc_auc binary 85.24% 10 0.29% Preprocessor1_Model018
0.0022 0.78 roc_auc binary 85.23% 10 0.33% Preprocessor1_Model075
0.022 0.22 roc_auc binary 85.21% 10 0.28% Preprocessor1_Model028
0.010 0.56 roc_auc binary 85.21% 10 0.28% Preprocessor1_Model057
0.0046 0.33 roc_auc binary 85.21% 10 0.32% Preprocessor1_Model036
0.0022 0.67 roc_auc binary 85.19% 10 0.33% Preprocessor1_Model065
0.010 0.11 roc_auc binary 85.18% 10 0.31% Preprocessor1_Model017
0.0010 1.0 roc_auc binary 85.16% 10 0.33% Preprocessor1_Model094
0.0022 0.56 roc_auc binary 85.15% 10 0.33% Preprocessor1_Model055
0.0010 0.89 roc_auc binary 85.14% 10 0.32% Preprocessor1_Model084
0.0046 0.22 roc_auc binary 85.13% 10 0.32% Preprocessor1_Model026
0.0010 0.78 roc_auc binary 85.13% 10 0.32% Preprocessor1_Model074
0.010 0.67 roc_auc binary 85.12% 10 0.28% Preprocessor1_Model067
0.0022 0.44 roc_auc binary 85.12% 10 0.32% Preprocessor1_Model045
0.0010 0.67 roc_auc binary 85.11% 10 0.31% Preprocessor1_Model064
0.0022 0.33 roc_auc binary 85.09% 10 0.32% Preprocessor1_Model035
0.0010 0.56 roc_auc binary 85.09% 10 0.31% Preprocessor1_Model054
0.0046 0.11 roc_auc binary 85.07% 10 0.32% Preprocessor1_Model016
0.00010 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model001
0.00022 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model002
0.00046 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model003
0.0010 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model004
0.0022 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model005
0.0046 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model006
0.010 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model007
0.022 0 roc_auc binary 85.07% 10 0.30% Preprocessor1_Model008
0.0022 0.22 roc_auc binary 85.06% 10 0.31% Preprocessor1_Model025
0.0010 0.44 roc_auc binary 85.06% 10 0.31% Preprocessor1_Model044
0.046 0.11 roc_auc binary 85.04% 10 0.26% Preprocessor1_Model019
0.022 0.33 roc_auc binary 85.04% 10 0.26% Preprocessor1_Model038
0.046 0 roc_auc binary 85.04% 10 0.29% Preprocessor1_Model009
0.00046 1.0 roc_auc binary 85.03% 10 0.30% Preprocessor1_Model093
0.0010 0.33 roc_auc binary 85.02% 10 0.30% Preprocessor1_Model034
0.010 0.78 roc_auc binary 85.01% 10 0.27% Preprocessor1_Model077
0.0022 0.11 roc_auc binary 85.01% 10 0.31% Preprocessor1_Model015
0.00046 0.89 roc_auc binary 85.01% 10 0.30% Preprocessor1_Model083
0.00046 0.78 roc_auc binary 85.00% 10 0.30% Preprocessor1_Model073
0.0010 0.22 roc_auc binary 84.99% 10 0.30% Preprocessor1_Model024
0.00046 0.67 roc_auc binary 84.98% 10 0.30% Preprocessor1_Model063
0.00046 0.56 roc_auc binary 84.96% 10 0.30% Preprocessor1_Model053
0.00046 0.44 roc_auc binary 84.94% 10 0.30% Preprocessor1_Model043
0.0010 0.11 roc_auc binary 84.93% 10 0.31% Preprocessor1_Model014
0.00022 1.0 roc_auc binary 84.91% 10 0.30% Preprocessor1_Model092
0.00046 0.33 roc_auc binary 84.91% 10 0.30% Preprocessor1_Model033
0.00022 0.89 roc_auc binary 84.90% 10 0.30% Preprocessor1_Model082
0.010 0.89 roc_auc binary 84.90% 10 0.26% Preprocessor1_Model087
0.00022 0.78 roc_auc binary 84.89% 10 0.30% Preprocessor1_Model072
0.00046 0.22 roc_auc binary 84.88% 10 0.31% Preprocessor1_Model023
0.00022 0.67 roc_auc binary 84.87% 10 0.30% Preprocessor1_Model062
0.00022 0.56 roc_auc binary 84.85% 10 0.30% Preprocessor1_Model052
0.00046 0.11 roc_auc binary 84.85% 10 0.31% Preprocessor1_Model013
0.10 0 roc_auc binary 84.85% 10 0.28% Preprocessor1_Model010
0.00010 1.0 roc_auc binary 84.84% 10 0.31% Preprocessor1_Model091
0.00022 0.44 roc_auc binary 84.84% 10 0.30% Preprocessor1_Model042
0.00022 0.33 roc_auc binary 84.83% 10 0.30% Preprocessor1_Model032
0.022 0.44 roc_auc binary 84.82% 10 0.25% Preprocessor1_Model048
0.00022 0.22 roc_auc binary 84.81% 10 0.31% Preprocessor1_Model022
0.00010 0.89 roc_auc binary 84.81% 10 0.30% Preprocessor1_Model081
0.00010 0.78 roc_auc binary 84.81% 10 0.30% Preprocessor1_Model071
0.00010 0.11 roc_auc binary 84.80% 10 0.31% Preprocessor1_Model011
0.00022 0.11 roc_auc binary 84.80% 10 0.31% Preprocessor1_Model012
0.00010 0.67 roc_auc binary 84.79% 10 0.31% Preprocessor1_Model061
0.00010 0.56 roc_auc binary 84.79% 10 0.31% Preprocessor1_Model051
0.00010 0.44 roc_auc binary 84.78% 10 0.31% Preprocessor1_Model041
0.010 1.0 roc_auc binary 84.78% 10 0.26% Preprocessor1_Model097
0.00010 0.33 roc_auc binary 84.77% 10 0.31% Preprocessor1_Model031
0.00010 0.22 roc_auc binary 84.77% 10 0.31% Preprocessor1_Model021
0.046 0.22 roc_auc binary 84.61% 10 0.24% Preprocessor1_Model029
0.022 0.56 roc_auc binary 84.56% 10 0.23% Preprocessor1_Model058
0.10 0.11 roc_auc binary 84.33% 10 0.23% Preprocessor1_Model020
0.022 0.67 roc_auc binary 84.33% 10 0.22% Preprocessor1_Model068
0.046 0.33 roc_auc binary 84.20% 10 0.21% Preprocessor1_Model039
0.022 0.78 roc_auc binary 84.10% 10 0.21% Preprocessor1_Model078
0.022 0.89 roc_auc binary 83.85% 10 0.20% Preprocessor1_Model088
0.046 0.44 roc_auc binary 83.80% 10 0.19% Preprocessor1_Model049
0.022 1.0 roc_auc binary 83.61% 10 0.19% Preprocessor1_Model098
0.10 0.22 roc_auc binary 83.58% 10 0.21% Preprocessor1_Model030
0.046 0.56 roc_auc binary 83.41% 10 0.18% Preprocessor1_Model059
0.046 0.67 roc_auc binary 83.05% 10 0.17% Preprocessor1_Model069
0.10 0.33 roc_auc binary 82.99% 10 0.19% Preprocessor1_Model040
0.046 0.78 roc_auc binary 82.67% 10 0.17% Preprocessor1_Model079
0.10 0.44 roc_auc binary 82.57% 10 0.18% Preprocessor1_Model050
0.046 0.89 roc_auc binary 82.32% 10 0.18% Preprocessor1_Model089
0.10 0.56 roc_auc binary 82.22% 10 0.19% Preprocessor1_Model060
0.046 1.0 roc_auc binary 82.13% 10 0.19% Preprocessor1_Model099
0.10 0.67 roc_auc binary 81.83% 10 0.22% Preprocessor1_Model070
0.10 0.78 roc_auc binary 81.57% 10 0.23% Preprocessor1_Model080
0.10 0.89 roc_auc binary 81.29% 10 0.24% Preprocessor1_Model090
0.10 1.0 roc_auc binary 80.90% 10 0.23% Preprocessor1_Model100

Árvore de decisão

A árvore de decisão particiona o espaço formado pelas variáveis explicativas em subespaços baseando-se na “pureza” desses subespaços com relação à variável dependente.

Abaixo fazemos uma experiência com apenas 2 features contínuas e uma feature categórica com dois valores possíveis .

receita_arvore_decisao_demo <- recipe(
  i_liked_partner ~
    i_found_partner__attractive +
    sex +
    my_age,
   data = dado_treino
)

arvore_decisao_mod <- 
  decision_tree(
    tree_depth = 4,
    min_n = 1,
    cost_complexity = 0
    
  ) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

wf_arvore_decisao_demo <- workflow() %>% 
  add_recipe(receita_arvore_decisao_demo) %>% 
  add_model(arvore_decisao_mod)

fit_ad_demo <- 
  wf_arvore_decisao_demo %>% 
  fit(
    data = dado_treino
  )

O algoritmo cira uma árvore de decisão como essa.

Ele escolhe, portanto, a ordem e os valores dos atributos que fatiarão a população. Seguindo a árvore até suas folhas, que são os nós sem filhos, podemos determinar a saída prevista de cada valor do vetor x.

fit_ad_demo$fit$fit$fit %>% rpart.plot() 

Temos um espaço formado por duas features contínuas e uma feature categórica com duas categorias, podemos pensar este espaço como dois planos.

O algoritmo da árvore de decisão escolhe qual feature divide cada plano em duas partes mais puras no sentido da classificação (com mais pontos com a mesma categoria).

Como temos duas features contínuas e mais uma com duas categorias possíveis, isso é equivalente a dividir o espaço representado por dois planos com retas na vertical e horizontal.

valores_partner_atractive = tibble(i_found_partner__attractive = seq(0, 10, by = 0.1))

valores_my_age = 
  tibble(
    my_age = seq(min(dado_treino$my_age), max(dado_treino$my_age), by = 1 ) 
  )

valores_sex = tibble( sex = c("Homem", "Mulher")  )



dados_novos <- crossing(
  valores_partner_atractive,
  valores_my_age,
  valores_sex
)

predicoes_arvore <- predict(
  object = fit_ad_demo,
  new_data = dados_novos
)


predicoes_arvore_com_dados <- bind_cols(
  dados_novos,
  predicoes_arvore
) 


ggplot(predicoes_arvore_com_dados) +
  geom_tile(
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      fill = .pred_class,
      alpha = 0.1
    )
  ) +
  geom_jitter(
    data = dado_treino,
    width = 0.5,
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      color = i_liked_partner
    ),
    size = 0.8,
    show.legend = FALSE
  ) +
  geom_vline(
    xintercept = 6.75
  ) +
  geom_vline(
    xintercept = 7.25
  ) +
  geom_hline(
    data = tibble(
      my_age = 20.5,
      sex = "Homem"
    ), 
    aes(
      yintercept = my_age
    )
  ) +
  geom_hline(
    data = tibble(
      my_age = 28.5,
      sex = "Mulher"
    ), 
    aes(
      yintercept = my_age
    )
  ) +
  facet_wrap(
    ~sex
  ) +
  guides(
    alpha = FALSE
  ) +
  theme_minimal() +
  scale_fill_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  scale_color_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  theme(
    legend.position = "top"
  ) +
  labs(
    fill = "",
    x = "Achei bonito",
    y = "Minha idade"
  )

log_reg_mod <- 
  logistic_reg(
  ) %>% 
  set_engine("glm") %>% 
  set_mode("classification")

wf_log_reg_demo <- workflow() %>% 
  add_recipe(receita_arvore_decisao_demo) %>% 
  add_model(log_reg_mod)

fit_log_reg_demo <- 
  wf_log_reg_demo %>% 
  fit(
    data = dado_treino
  )

A título de comparação, podemos ver o mesmo esquema com a regressão logística

predicoes_log_reg <- predict(
  object = fit_log_reg_demo,
  new_data = dados_novos
)


predicoes_log_reg_com_dados <- bind_cols(
  dados_novos,
  predicoes_log_reg
) 



ggplot(predicoes_log_reg_com_dados) +
  geom_tile(
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      fill = .pred_class,
      alpha = 0.1
    )
  ) +
  geom_jitter(
    data = dado_treino,
    width = 0.5,
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      color = i_liked_partner
    ),
    size = 0.8,
    show.legend = FALSE
  ) +
  facet_wrap(
    ~sex
  ) +
  guides(
    alpha = FALSE
  ) +
  theme_minimal() +
  scale_fill_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  scale_color_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  theme(
    legend.position = "top"
  ) +
  labs(
    fill = "",
    x = "Achei bonito",
    y = "Minha idade"
  )

Random Forest

A forma como a árvore de decisão é criada faz com que ela tenha muita variância.

Cada decisão de particionamento é tomada a partir de características que podem ser muito específicas aos dados de treinamento.

Uma ideia usada nas Random Forests é criar \(trees\) conjuntos de treinamento a partir do conjunto original, mas retirando amostras de mesmo tamanho do conjunto original, com reposição. Além disso, cada vez que a árvore é particionada, a partição só pode acontecer em \(mtry\) das variáveis explicativas.

O resultado final é uma média da decisão dessas \(trees\) árvores.

Estas duas mudanças fazem com que o modelo tenha uma variância muito menor do que as árvore de decisão simples.

ranger_mod <- 
  rand_forest(
    mtry = 2,
    trees = 100,
    min_n = 1
  ) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

wf_ranger_demo <- workflow() %>% 
  add_recipe(receita_arvore_decisao_demo) %>% 
  add_model(ranger_mod)

fit_ranger_demo <- 
  wf_ranger_demo %>% 
  fit(
    data = dado_treino
  )

Random Forest: maior flexibilidade

Abaixo podemos ver que o modelo oferece muito mais flexibilidade que as árvores de decisão simples.

predicoes_ranger <- predict(
  object = fit_ranger_demo,
  new_data = dados_novos
)


predicoes_ranger_com_dados <- bind_cols(
  dados_novos,
  predicoes_ranger
) 



ggplot(predicoes_ranger_com_dados) +
  geom_tile(
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      fill = .pred_class,
      alpha = 0.1
    )
  ) +
  geom_jitter(
    data = dado_treino,
    width = 0.5,
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      color = i_liked_partner
    ),
    size = 0.8,
    show.legend = FALSE
  ) +
  facet_wrap(
    ~sex
  ) +
  guides(
    alpha = FALSE
  ) +
  theme_minimal() +
  scale_fill_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  scale_color_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  theme(
    legend.position = "top"
  ) +
  labs(
    fill = "",
    x = "Achei bonito",
    y = "Minha idade"
  )

Random Forest: montando o tuning da Random Forest

Abaixo montamos a nova configuração de modelo, com seus hiperparâmetros a serem tunados.

tune_spec_rand_forest <- rand_forest(
  mtry = tune(),
  trees = tune(),
  min_n = tune()
) %>% 
  set_engine(
    engine = "ranger"
  ) %>% 
  set_mode(
    "classification"
  )

tune_spec_rand_forest
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
##   min_n = tune()
## 
## Computational engine: ranger

Fazendo o tuning da random forest

Vamos criar um primeiro grid para tentar otimizar o valor dos hiperparâmetros

rand_for_grid <- grid_regular(
  mtry(range = c(3,50)),
  trees(range = c(1,100)),
  min_n(),
  levels = 4
)

rand_for_grid
## # A tibble: 64 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     3     1     2
##  2    18     1     2
##  3    34     1     2
##  4    50     1     2
##  5     3    34     2
##  6    18    34     2
##  7    34    34     2
##  8    50    34     2
##  9     3    67     2
## 10    18    67     2
## # ... with 54 more rows
wf_rand_for_tune_sample <- workflow() %>% 
  add_model(
    tune_spec_rand_forest
  ) %>% 
  add_recipe(receita_com_interacao_class)
all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


res_rand_for <- wf_rand_for_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = rand_for_grid,
    control = control_grid(allow_par = TRUE, verbose = TRUE)
  )  

stopCluster(cl)
res_rand_for <- read_rds("resultados/res_rand_for.rds" )


collect_metrics(res_rand_for) %>% arrange(desc(mean)) %>% 
  filter(
    .metric == "roc_auc"
  ) %>% 
  head(
    n = 10
  ) %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(mean, std_err)
  ) 
mtry trees min_n .metric .estimator mean n std_err .config
18 100 2 roc_auc binary 88.03% 10 0.48% Preprocessor1_Model14
50 100 2 roc_auc binary 87.72% 10 0.49% Preprocessor1_Model16
34 100 2 roc_auc binary 87.67% 10 0.46% Preprocessor1_Model15
18 67 2 roc_auc binary 87.67% 10 0.47% Preprocessor1_Model10
18 100 14 roc_auc binary 87.60% 10 0.50% Preprocessor1_Model30
50 67 2 roc_auc binary 87.44% 10 0.49% Preprocessor1_Model12
34 100 14 roc_auc binary 87.43% 10 0.50% Preprocessor1_Model31
34 67 2 roc_auc binary 87.43% 10 0.44% Preprocessor1_Model11
18 67 14 roc_auc binary 87.41% 10 0.47% Preprocessor1_Model26
50 100 14 roc_auc binary 87.34% 10 0.43% Preprocessor1_Model32

Visualizando os primeiros resultados do tuning

plot_result_tune_ranger <- function(results){
  

  results %>% 
    map_df(
      .f = collect_metrics
    ) %>% 
    filter(
      .metric == "roc_auc"
    ) %>% 
    mutate(
      ranque = rank(mean)
    ) %>% 
    ggplot() +
    geom_tile(
      aes(
        x = mtry %>%  factor(),
        y = trees %>%  factor(),
        fill = ranque
      )
    ) +
    geom_shadowtext(
      aes(
        x = mtry %>% factor(),
        y = trees %>%  factor(),
        label = percent(mean, accuracy = .01),
      ),
      color = "white",
      bg.colour="black",
      size = 3,
    ) +
    scale_fill_gradient(low = "white", high = "darkgreen") +
    facet_wrap(
      ~min_n, 
      ncol = 1, 
      labeller = as_labeller( function(x){str_glue("min_n: {x}")}  )
    ) +
    theme_minimal() +
    theme(
      legend.position = "top"
    ) +
    labs(
      x = "mtry",
      y = "trees"
       
    ) 
}

plot_result_tune_ranger(list(res_rand_for))

Segunda rodada de otimização

rand_for_grid_optim <- grid_regular(
  mtry(range = c(4,45)),
  trees(range = c(150,350)),
  min_n(c(1,6)),
  levels = 4
)





all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


res_rand_for_optim <- wf_rand_for_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = rand_for_grid_optim,
    control = control_grid(allow_par = TRUE, verbose = TRUE)
  )  

stopCluster(cl)
res_rand_for_optim <- read_rds("resultados/res_rand_for_optim.rds" )


collect_metrics(res_rand_for_optim) %>% arrange(desc(mean)) %>% 
  filter(
    .metric == "roc_auc"
  ) %>% 
  head(
    n = 10
  ) %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(mean, std_err)
  ) 
mtry trees min_n .metric .estimator mean n std_err .config
17 283 1 roc_auc binary 88.29% 10 0.46% Preprocessor1_Model10
17 350 2 roc_auc binary 88.19% 10 0.45% Preprocessor1_Model30
17 350 1 roc_auc binary 88.18% 10 0.44% Preprocessor1_Model14
17 216 2 roc_auc binary 88.18% 10 0.45% Preprocessor1_Model22
17 216 1 roc_auc binary 88.13% 10 0.47% Preprocessor1_Model06
17 350 4 roc_auc binary 88.12% 10 0.46% Preprocessor1_Model46
17 283 2 roc_auc binary 88.12% 10 0.44% Preprocessor1_Model26
17 350 6 roc_auc binary 88.10% 10 0.43% Preprocessor1_Model62
17 283 4 roc_auc binary 88.09% 10 0.46% Preprocessor1_Model42
31 283 4 roc_auc binary 88.07% 10 0.45% Preprocessor1_Model43
plot_result_tune_ranger(list(res_rand_for, res_rand_for_optim))

Terceira rodada de otimização

rand_for_grid_optim_2 <- grid_regular(
  mtry(range = c(10,35)),
  trees(range = c(350,500)),
  min_n(range = c(1, 3)),
  levels = c(mtry = 3, trees = 4, min_n = 3)
    
)



all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


res_rand_for_optim_2 <- wf_rand_for_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = rand_for_grid_optim_2,
    control = control_grid(allow_par = TRUE, verbose = TRUE)
  )  

stopCluster(cl)
res_rand_for_optim_2 <- read_rds("resultados/res_rand_for_optim_2.rds" )

collect_metrics(res_rand_for_optim_2) %>% arrange(desc(mean)) %>% 
  filter(
    .metric == "roc_auc"
  ) %>% 
  head(
    n = 10
  ) %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(mean, std_err)
  ) 
mtry trees min_n .metric .estimator mean n std_err .config
10 500 1 roc_auc binary 88.31% 10 0.42% Preprocessor1_Model10
22 500 1 roc_auc binary 88.30% 10 0.42% Preprocessor1_Model11
10 500 2 roc_auc binary 88.26% 10 0.42% Preprocessor1_Model22
22 350 1 roc_auc binary 88.24% 10 0.44% Preprocessor1_Model02
10 450 1 roc_auc binary 88.24% 10 0.43% Preprocessor1_Model07
22 450 1 roc_auc binary 88.24% 10 0.47% Preprocessor1_Model08
22 400 1 roc_auc binary 88.24% 10 0.48% Preprocessor1_Model05
10 400 2 roc_auc binary 88.24% 10 0.43% Preprocessor1_Model16
22 500 3 roc_auc binary 88.22% 10 0.45% Preprocessor1_Model35
22 500 2 roc_auc binary 88.21% 10 0.47% Preprocessor1_Model23
plot_result_tune_ranger(list(res_rand_for, res_rand_for_optim, res_rand_for_optim_2 ))

Quarta rodada de otimização

rand_for_grid_optim_3 <- grid_regular(
  mtry(range = c(8, 22)),
  trees(range = c(400,500)),
  min_n(range = c(1,1)),
  levels = c(mtry = 5, trees = 5, min_n = 1)
    
)

rand_for_grid_optim_3
## # A tibble: 25 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     8   400     1
##  2    11   400     1
##  3    15   400     1
##  4    18   400     1
##  5    22   400     1
##  6     8   425     1
##  7    11   425     1
##  8    15   425     1
##  9    18   425     1
## 10    22   425     1
## # ... with 15 more rows
res_rand_for_optim_3 <- read_rds("resultados/res_rand_for_optim_3.rds" )

collect_metrics(res_rand_for_optim_3) %>% arrange(desc(mean)) %>% 
  filter(
    .metric == "roc_auc"
  ) %>% 
  head(
    n = 10
  ) %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(mean, std_err)
  ) 
mtry trees min_n .metric .estimator mean n std_err .config
22 475 1 roc_auc binary 88.25% 10 0.42% Preprocessor1_Model20
18 500 1 roc_auc binary 88.25% 10 0.46% Preprocessor1_Model24
15 450 1 roc_auc binary 88.24% 10 0.45% Preprocessor1_Model13
15 500 1 roc_auc binary 88.24% 10 0.45% Preprocessor1_Model23
18 475 1 roc_auc binary 88.23% 10 0.44% Preprocessor1_Model19
11 450 1 roc_auc binary 88.22% 10 0.42% Preprocessor1_Model12
15 425 1 roc_auc binary 88.22% 10 0.44% Preprocessor1_Model08
11 500 1 roc_auc binary 88.22% 10 0.44% Preprocessor1_Model22
18 425 1 roc_auc binary 88.22% 10 0.46% Preprocessor1_Model09
18 400 1 roc_auc binary 88.22% 10 0.45% Preprocessor1_Model04
plot_result_tune_ranger(
  list(
    res_rand_for,
    res_rand_for_optim,
    res_rand_for_optim_2,
    res_rand_for_optim_3
  )
)

Redes neurais

Apesar do hype que envolve as redes neurais, que faz elas parecerem mágicas e misteriosas, elas são métodos não lineares onde são aplicadas regressões lineares em cima de saída de outras regressões. Algumas dessas saídas são modificadas por funções de ativação não lineares, dando caráter não linear ao método. Os coeficientes dessas regressões, também chamados de pesos, são calibrados de forma a minimizar o erro através de um algoritmo inteligente chamado backpropagation.

Uma arquitetura de rede neural

Existem várias arquiteturas de redes neurais.

Vamos usar a arquitetura mais simples e mais usada: single hidden layer.

Como mostra a figura à esquerda, esse modelo tem uma camada de entrada, uma camada “escondida” intermediária e uma camada de saída.

\(p\) é o número de entradas, \(K\) o número de categorias de saída possíveis (no caso de classificação, no caso de regressão há uma saída) e \(M\), o número de neurônios na camada escondida.

O nome neurônio vem de uma simplificação de como funciona a célula. Assim como os nós da rede neural, a célula também recebe várias entradas e tem um processo de ativação (liga ou desliga) dependendo da intensidade da informação que rebebe nas entradas.

No entando o cérebro humano é uma rede extremamente complexa com uma arquitetura que foi selecionada durante bilhões de anos.

Existem arquiteturas de rede neural mais complicadas, mas o número de pesos a serem calibrados é explosivo. Elas exigem uma quantidade massiva de dados. A área que estuda essa redes profundas se chama deep learning.

Um exemplo de rede neural com 3 entradas

Vamos fazer o mesmo exercício que já fizemos com os outros modelos, treinando ela para apenas dois features contínuos e um categórico.

set.seed(555)


receita_nnet_demo <- recipe(
  i_liked_partner ~
    i_found_partner__attractive +
    sex +
    my_age,
   data = dado_treino
) %>% 
  step_center(all_numeric()) %>%
  step_scale(all_numeric())
  

nnet_mod <- 
  mlp(
    hidden_units = 10,
    epochs = 1000
  ) %>% 
  set_engine("nnet") %>% 
  set_mode("classification")

wf_nnet_demo <- workflow() %>% 
  add_recipe(receita_nnet_demo) %>% 
  add_model(nnet_mod)

fit_nnet_demo <- 
  wf_nnet_demo %>% 
  fit(
    data = dado_treino
  )
  
fit_nnet_demo
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: mlp()
## 
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
## 
## * step_center()
## * step_scale()
## 
## -- Model -----------------------------------------------------------------------
## a 3-10-1 network with 51 weights
## inputs: i_found_partner__attractive sexHomem my_age 
## output(s): ..y 
## options were - entropy fitting

Podemos perceber que o modelo cria um padrão de classificação complexo, que consegue considerar a interação entre as entradas.

predicoes_nnet <- predict(
  object = fit_nnet_demo,
  new_data = dados_novos
)


predicoes_nnet_com_dados <- bind_cols(
  dados_novos,
  predicoes_nnet
) 



ggplot(predicoes_nnet_com_dados) +
  geom_tile(
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      fill = .pred_class,
      alpha = 0.1
    )
  ) +
  geom_jitter(
    data = dado_treino,
    width = 0.5,
    aes(
      x = i_found_partner__attractive,
      y = my_age,
      color = i_liked_partner
    ),
    size = 0.8,
    show.legend = FALSE
  ) +
  facet_wrap(
    ~sex
  ) +
  guides(
    alpha = FALSE
  ) +
  theme_minimal() +
  scale_fill_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  scale_color_manual(
    values = c(Liked = "darkgreen", Not = "darkred")
  ) +
  theme(
    legend.position = "top"
  ) +
  labs(
    fill = "",
    x = "Achei bonito",
    y = "Minha idade"
  )

Fazendo tuning na rede neural

hidden units é o número de neurônios na camada escondida. Quanto maior o número maior a flexibilidade da rede.

penalty é uma penalidade similar à da regressão do tipo ridge, também limita o número de pesos ativos.

epochs é o número de iterações usadas pra treinar a rede. Quanto mais iterações mais adaptada ao conjunto de entrada a rede estará.

nnet_grid <- grid_regular(
  hidden_units(),
  penalty(),
  epochs(),
  levels = 3
)

nnet_grid
## # A tibble: 27 x 3
##    hidden_units      penalty epochs
##           <int>        <dbl>  <int>
##  1            1 0.0000000001     10
##  2            5 0.0000000001     10
##  3           10 0.0000000001     10
##  4            1 0.00001          10
##  5            5 0.00001          10
##  6           10 0.00001          10
##  7            1 1                10
##  8            5 1                10
##  9           10 1                10
## 10            1 0.0000000001    505
## # ... with 17 more rows
receita_com_interacao_class <- receita_com_interacao_class %>% 
  step_center(all_numeric()) %>%
  step_scale(all_numeric())
  

tune_spec_nnet <- parsnip::mlp(
  
  hidden_units = tune(),
  penalty = tune(),
  epochs = tune()

) %>% 
  set_engine(
    engine = "nnet",
    MaxNWts = 3000
  ) %>% 
  set_mode(
    "classification"
  )


wf_nnet_tune_sample <- workflow() %>% 
  add_model(
    tune_spec_nnet
  ) %>% 
  add_recipe(receita_com_interacao_class_center_scale)


all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


res_nnet <- wf_nnet_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = nnet_grid,
    control = control_grid(allow_par = TRUE, verbose = TRUE)
  )  

stopCluster(cl)
 



write_rds(res_nnet, "resultados/res_nnet.rds")
res_nnet <- read_rds("resultados/res_nnet.rds")

collect_metrics(res_nnet) %>% arrange(desc(mean))
## # A tibble: 54 x 9
##    hidden_units   penalty epochs .metric .estimator  mean     n std_err .config 
##           <int>     <dbl>  <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>   
##  1            1  1.00e+ 0   1000 roc_auc binary     0.847    10 0.00281 Preproc~
##  2            1  1.00e+ 0    505 roc_auc binary     0.847    10 0.00251 Preproc~
##  3           10  1.00e+ 0     10 roc_auc binary     0.836    10 0.00390 Preproc~
##  4           10  1.00e+ 0    505 roc_auc binary     0.836    10 0.00354 Preproc~
##  5           10  1.00e-10     10 roc_auc binary     0.834    10 0.00352 Preproc~
##  6           10  1.00e+ 0   1000 roc_auc binary     0.831    10 0.00525 Preproc~
##  7            5  1.00e+ 0     10 roc_auc binary     0.828    10 0.00460 Preproc~
##  8            5  1.00e+ 0    505 roc_auc binary     0.828    10 0.00405 Preproc~
##  9           10  1.00e- 5     10 roc_auc binary     0.825    10 0.00558 Preproc~
## 10            1  1.00e+ 0     10 roc_auc binary     0.823    10 0.00557 Preproc~
## # ... with 44 more rows
plot_result_tune_nnet <- function(results){
  

  results %>% 
    map_df(
      .f = collect_metrics
    ) %>% 
    filter(
      .metric == "roc_auc"
    ) %>% 
    group_by(
      penalty, epochs, hidden_units
    ) %>% 
    summarise(
      mean = mean(mean)
    ) %>% 
    ungroup() %>% 
    mutate(
      ranque = rank(mean)
    ) %>% 
    ggplot() +
    geom_tile(
      aes(
        x = penalty ,
        y = epochs %>%  factor(),
        fill = ranque
      )
    ) +
    scale_x_log10() +
    geom_shadowtext(
      aes(
        x = penalty,
        y = epochs %>%  factor(),
        label = percent(mean, accuracy = .01),
      ),
      color = "white",
      bg.colour="black",
      size = 3
    ) +
    scale_fill_gradient(low = "white", high = "darkgreen") +
    facet_wrap(
      ~hidden_units, 
      ncol = 1,
      labeller = as_labeller( function(x){str_glue("hidden: {x}")}  )
    ) +
    theme_minimal() +
    theme(
      legend.position = "top"
    )
}


plot_result_tune_nnet(list(res_nnet))

Fazendo tuning na rede neural, segunda rodada

nnet_grid_2 <- grid_regular(
  hidden_units(range = c(1,10)),
  penalty( range = c(-1,1)),
  epochs(range = c(10, 1000)),
  levels = c(hidden_units = 3, penalty = 2, epochs = 3)
)

nnet_grid_2
## # A tibble: 18 x 3
##    hidden_units penalty epochs
##           <int>   <dbl>  <int>
##  1            1     0.1     10
##  2            5     0.1     10
##  3           10     0.1     10
##  4            1    10       10
##  5            5    10       10
##  6           10    10       10
##  7            1     0.1    505
##  8            5     0.1    505
##  9           10     0.1    505
## 10            1    10      505
## 11            5    10      505
## 12           10    10      505
## 13            1     0.1   1000
## 14            5     0.1   1000
## 15           10     0.1   1000
## 16            1    10     1000
## 17            5    10     1000
## 18           10    10     1000
all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


res_nnet_2 <- wf_nnet_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = nnet_grid_2,
    control = control_grid(allow_par = TRUE, verbose = TRUE)
  )  

stopCluster(cl)
 
write_rds(res_nnet_2, "resultados/res_nnet_2.rds")
res_nnet_2 <- read_rds("resultados/res_nnet_2.rds")


collect_metrics(res_nnet_2) %>% arrange(desc(mean))
## # A tibble: 36 x 9
##    hidden_units penalty epochs .metric .estimator  mean     n std_err .config   
##           <int>   <dbl>  <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>     
##  1           10    10     1000 roc_auc binary     0.859    10 0.00308 Preproces~
##  2           10    10      505 roc_auc binary     0.859    10 0.00317 Preproces~
##  3            5    10     1000 roc_auc binary     0.858    10 0.00302 Preproces~
##  4            5    10      505 roc_auc binary     0.858    10 0.00342 Preproces~
##  5            1    10      505 roc_auc binary     0.849    10 0.00320 Preproces~
##  6            1    10     1000 roc_auc binary     0.848    10 0.00314 Preproces~
##  7           10    10       10 roc_auc binary     0.842    10 0.00455 Preproces~
##  8            5    10       10 roc_auc binary     0.838    10 0.00274 Preproces~
##  9           10     0.1     10 roc_auc binary     0.837    10 0.00372 Preproces~
## 10            1     0.1    505 roc_auc binary     0.832    10 0.00159 Preproces~
## # ... with 26 more rows
plot_result_tune_nnet(list(res_nnet_2, res_nnet))

nnet_grid_3 <- grid_regular(
  hidden_units(range = c(1,15)),
  penalty( range = c(0,1)),
  epochs(range = c(1500, 2000)),
  levels = c(hidden_units = 4, penalty = 2, epochs = 2)
)





all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


res_nnet_3 <- wf_nnet_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = nnet_grid_3,
    control = control_grid(allow_par = TRUE, verbose = TRUE)
  )  

stopCluster(cl)
 
write_rds(res_nnet_3, "resultados/res_nnet_3.rds")
res_nnet_3 <- read_rds("resultados/res_nnet_3.rds")


collect_metrics(res_nnet_3) %>% arrange(desc(mean))
## # A tibble: 32 x 9
##    hidden_units penalty epochs .metric .estimator  mean     n std_err .config   
##           <int>   <dbl>  <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>     
##  1            5      10   2000 roc_auc binary     0.859    10 0.00288 Preproces~
##  2           10      10   2000 roc_auc binary     0.859    10 0.00300 Preproces~
##  3           10      10   1500 roc_auc binary     0.859    10 0.00339 Preproces~
##  4           15      10   2000 roc_auc binary     0.859    10 0.00308 Preproces~
##  5           15      10   1500 roc_auc binary     0.858    10 0.00318 Preproces~
##  6            5      10   1500 roc_auc binary     0.858    10 0.00347 Preproces~
##  7            1      10   1500 roc_auc binary     0.849    10 0.00314 Preproces~
##  8            1      10   2000 roc_auc binary     0.848    10 0.00318 Preproces~
##  9           15       1   2000 roc_auc binary     0.847    10 0.00304 Preproces~
## 10            1       1   1500 roc_auc binary     0.846    10 0.00284 Preproces~
## # ... with 22 more rows
plot_result_tune_nnet(list(res_nnet_2, res_nnet, res_nnet_3))

nnet_grid_4 <- grid_regular(  
  hidden_units(range = c(10,15)),
  penalty( range = c(1,2)),
  epochs(range = c(2500, 3500)),
  levels = c(hidden_units = 2, penalty = 2, epochs = 3)
)



all_cores <- parallel::detectCores(logical = FALSE)

library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


res_nnet_4 <- wf_nnet_tune_sample %>% 
  tune_grid(
    resamples = folds,
    grid = nnet_grid_4,
    control = control_grid(allow_par = TRUE, verbose = TRUE)
  )  

stopCluster(cl)
 
write_rds(res_nnet_4, "resultados/res_nnet_4.rds")
res_nnet_4 <- read_rds("resultados/res_nnet_4.rds")


collect_metrics(res_nnet_4) %>% arrange(desc(mean))
## # A tibble: 24 x 9
##    hidden_units penalty epochs .metric .estimator  mean     n std_err .config   
##           <int>   <dbl>  <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>     
##  1           10      10   2500 roc_auc binary     0.859    10 0.00328 Preproces~
##  2           15      10   2500 roc_auc binary     0.859    10 0.00323 Preproces~
##  3           10      10   3000 roc_auc binary     0.858    10 0.00319 Preproces~
##  4           15      10   3500 roc_auc binary     0.858    10 0.00306 Preproces~
##  5           15      10   3000 roc_auc binary     0.858    10 0.00314 Preproces~
##  6           10      10   3500 roc_auc binary     0.858    10 0.00308 Preproces~
##  7           10     100   3000 roc_auc binary     0.834    10 0.00428 Preproces~
##  8           10     100   2500 roc_auc binary     0.834    10 0.00428 Preproces~
##  9           10     100   3500 roc_auc binary     0.834    10 0.00427 Preproces~
## 10           15     100   2500 roc_auc binary     0.834    10 0.00426 Preproces~
## # ... with 14 more rows
plot_result_tune_nnet(list(res_nnet_2, res_nnet, res_nnet_3, res_nnet_4))

Validando o modelo: selecionando o melhor modelo

Já experimentamos alguns modelos e otimizamos seus hiperparâmetros, testando-os no conjunto de validação, sempre fora do conjunto de treinamento.

O nosso melhor modelo foi uma das configurações de random forest.

Selecionamos esse modelo com select_best()

best_ranger <- 
  select_best(
    res_rand_for_optim_2  
  )

best_ranger
## # A tibble: 1 x 4
##    mtry trees min_n .config              
##   <int> <int> <int> <chr>                
## 1    10   500     1 Preprocessor1_Model10

Então criamos um workflow baseado no workflow que estávamos usando mas adicionando a informação de que este é o modelo escolhido

final_wf_ranger <- wf_rand_for_tune_sample %>% 
  finalize_workflow(best_ranger)

final_wf_ranger
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 22 Recipe Steps
## 
## * step_ordinalscore()
## * step_dummy()
## * step_mutate()
## * step_poly()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * ...
## * and 12 more steps.
## 
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 10
##   trees = 500
##   min_n = 1
## 
## Computational engine: ranger

Validando o modelo: treinando o modelo com todos os dados de treinamento/validação

final_ranger <- 
  final_wf_ranger %>%
  fit(data = dado_treino) 


final_ranger
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 22 Recipe Steps
## 
## * step_ordinalscore()
## * step_dummy()
## * step_mutate()
## * step_poly()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * ...
## * and 12 more steps.
## 
## -- Model -----------------------------------------------------------------------
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~10L,      x), num.trees = ~500L, min.node.size = min_rows(~1L, x),      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      3664 
## Number of independent variables:  164 
## Mtry:                             10 
## Target node size:                 1 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1446971

Validando o modelo: treinando o modelo com todos os dados de treinamento/validação

all_cores <- parallel::detectCores(logical = FALSE)


cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)


final_fit_ranger <- 
  final_wf_ranger %>% 
  last_fit(split_dado)

stopCluster(cl)

write_rds(final_fit_ranger, "resultados/final_fit_ranger.rds")

Podemos perceber que o modelo conseguiu generalizar os resultados para os dados de teste. Uma boa notícia.

final_fit_ranger <- read_rds("resultados/final_fit_ranger.rds")


final_fit_ranger %>% collect_metrics() %>% 
  gt() %>% 
  fmt_percent(
    columns = vars(.estimate)
  )
.metric .estimator .estimate .config
accuracy binary 80.84% Preprocessor1_Model1
roc_auc binary 89.26% Preprocessor1_Model1

Performance nos dados de teste: curva ROC

curva <- final_fit_ranger %>% 
  collect_predictions() %>% 
  roc_curve(
    truth = i_liked_partner,
    .pred_Liked
  )

ponto_gatilho <- curva %>% 
  filter(
    .threshold > 0.45 
  ) %>% 
  slice_min(
    n = 1, order_by = .threshold
  )
  

  
curva %>% ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_point(
    data = ponto_gatilho,
    aes(x = 1 - specificity, y = sensitivity),
    size = 3,
    color = "darkblue"
  ) +
  geom_text_repel(
    data = ponto_gatilho,
    aes(
      x = 1 - specificity + 0.15, 
      y = sensitivity - 0.15,
      label = str_glue("{sensitivity %>% percent(accuracy = 0.1)}/{(1 - specificity) %>%  percent(accuracy = 0.1)}")
    )
  ) +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_bw()